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I. INTRODUCTION 


A, BACKGROUND 

In an environment of inereasing congressional pressure and decreasing defense 
funding, the Department of Defense (DOD) has been investing considerable effort and 
resources into increasing the efficiency and effectiveness of its supply chain 
management. To set a common understanding of that issue, the next section provides a 
summary of the pivotal events that led to the Comprehensive Inventory Management 
Improvement Plan (CIMIP), which aims to reduce excess DOD secondary inventory. 
“DOD defines secondary items as minor end items; replacement, spare, and repair 
components; personnel support and consumable items. Examples of secondary items 
include aircraft, tank, and ship components; construction, medical, and dental supplies; 
and food, clothing, and fuel” (General Accounting Office [GAO], 1988, p. 1). Principal 
inventory items consist of items such as aircraft, vehicles and ships. DOD stratifies 
secondary inventory into four categories: approved acquisition objective, economic 
retention stock, contingency retention stock, and potential reutilization stock. The 
approved acquisition objective stock is calculated in order to meet current requirements, 
while the other three categories are considered by GAO to be in excess of current 
requirements (Government Accountability Office [GAO], 2015b). While not directly 
stated, the DOD appears to only consider potential reutilization stock as excess and seems 
reluctant to dispose of economic and contingency retention stocks due to the potential 
that they will be needed in the future. Figure 1. shows how much of the Navy’s 
secondary inventory was considered excess in fiscal years 2004 through 2007. 

1. Pre CIMIP 

On September 8, 1982, the U.S. Congress enacted the Federal Managers Financial 
Integrity Act (FMFIA). Primarily an amendment to the Accounting and Auditing Act of 
1950, it required “ongoing evaluations and reports of the adequacy of the systems of 
internal accounting and administrative control of each executive agency” (Federal 
Managers Financial Integrity Act of 1982, 2012). While implementation of the act did not 
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immediately solve the issues that it intended to address (GAO, 1989), it became a driving 
force behind the ongoing efforts to improve the way that the federal government manages 
resources. 

In July 1988, the General Accounting Office, as GAO was known at the time, 
published a report in response to Senate inquiries regarding the growth of secondary item 
inventories within the DOD (GAO, 1988). Between 1980 and 1987, according to the 
report, the value of the DOD’s secondary items grew from $43 billion to $94 billion, 
about $19 billion of which was attributable to the Navy. Of this $51 billion dollar 
increase, $27 billion was due to the increasing size of the U.S. military, while $19 billion 
was considered to be in excess of requirements and $5 billion was “unstratified,” which 
means that it was not allocated to a specific inventory purpose such as current 
requirement or economic retention. This report contributed to the growing number of 
GAO studies concluding that DOD needed to do a better job of managing its inventory. 

On January 23, 1990, GAO released a letter from the comptroller general (CG) of 
the United States addressed to the chairman of the U.S. Senate committee on 
governmental affairs and the chairman of the U.S. House of Representatives committee 
on government operations (GAO, 1990). In the letter, the CG highlights the need to 
improve the internal controls and financial management systems of the federal 
government. In October 1989, in support of the Office of Management and Budget 
(0MB) identification of “high risk” areas, and after reviewing reports submitted under 
FMFIA, the CG identified 14 target areas that would receive special attention from the 
GAO. One of those areas singled out for special review was the DOD inventory 
management systems, due to growing excess inventory levels now valued at over $30 
billion, and numerous other indicators of poor financial management. Since that time, the 
GAO has considered the DOD’s inventory management a high-risk area, and although 
the name of the problem has changed to DOD supply chain management, it remains one 
of the 32 high-risk areas on the GAO’s 2015 list (GAO, 2015a). 

In December 2008, the GAO published a report that evaluated the cost efficiency 

of the Navy’s spare parts inventory. In explaining why the Navy had accumulated excess 

secondary inventory, the report concluded, “much of the inventory that exceeded current 
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requirements or had inventory deficits resulted from inaccurate demand forecasts” (GAO, 
2008, p. 34). The report also documented the results from surveys of the Navy’s Item 
Managers (IM) who identified many additional factors that they felt were contributing to 
inventory excesses and deficits (GAO, 2008). From 2004 to 2007, GAO calculated that 
secondary inventory in excess of current requirements averaged about 40%, or $7.5 
billion, of total Navy inventory. Figure 1. from the report shows this trend in 2007 
dollars. This report was the second in a series of GAO reports that reviewed the 
secondary inventory management of the Air Force (GAO, 2007), Army (GAO, 2009) and 
Defense Logistics Agency (DLA) (GAO, 2010). To varying degrees, each of these 
reports commented on the need for improved demand forecasting. Subsequently, GAO 
concluded that “inaccurate demand forecasting is the leading reason for the accumulation 
of excess inventory” (GAO, 2011, p. 11) throughout the services and DLA. 

Figure 1. Navy Secondary Inventory Meeting and Exceeding Requirements 

(FY 2004-2007). Source: GAO (2008). 

Dollars (in billions) 



Fiscal year 

Beyond current requirements 
Current requirements 
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After 20 years of effort with little improvement, Congress inserted language into 
the fiseal year (FY) 2010 National Defense Authorization Aet (NDAA) that required the 
development of an extensive plan that would improve the inventory management 
practices within the DOD. When the NDAA was enacted on October 28, 2009, section 
328 required that this plan be provided to Congress for review within 270 days. The plan 
was required to address eight separate elements intended to improve “the inventory 
management systems of the military departments and the Defense Logistics Agency with 
the objective of reducing the acquisition and storage of secondary inventory that is excess 
to requirements” (NDAA, 2009, para [a]). The most relevant aspect to this research is the 
second part of the first element, which required the “development of metrics to identify 
bias toward over-forecasting and adjust forecasting methods accordingly” (NDAA, 2009, 
para [b(l)]). This legal requirement would eventually result in the DOD developing a 
common metric for forecast accuracy and forecast bias that would measure the 
performance of each military service and DLA. 

2, CIMIP 

As required by the FYIO NDAA section 328, the Assistant Secretary of Defense 
for Logistics and Materiel Readiness published the DOD’s Comprehensive Inventory 
Management Improvement Plan in October 2010. In addition to fulfilling the demands of 
Congress, the objective of the plan was to drive “a prudent reduction in current inventory 
excesses as well as a reduction in the potential for future excesses without degrading 
materiel support to the customer” (Assistant Secretary of Defense for Logistics and 
Material Readiness (ASD[L&MR]), 2010, p. hi). In that document, chapter one contains 
an overview of inventory management improvement, assigns responsibilities and 
highlights the implementation strategy. Chapters two through nine detail the eight sub¬ 
plans that have been developed to address the eight elements required by section 328, 
while chapter ten details four additional improvement actions that the DOD is developing 
on their own initiative. Although these department-wide actions were not specifically 
required by section 328, they were included in the plan because “these actions support the 
Department’s intent to improve DOD inventory management and reduce excesses” 

(ASD[L&MR], 2010, p. 10-1). Appendix A lists 17 other DOD strategies, plans, or 
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efforts that are eonsistent with the CIMIP overall objeetive of redueing seeondary item 
inventory levels. Most importantly, Appendix A highlights that the plan is eonsistent with 
the objeetives of the DOD Logisties Strategie Plan, whieh “identifies high level goals, 
performanee measures, and key initiatives that support the DOD priorities and drive the 
logisties enterprise improvements” (ASD[L&MR], 2010, p. A-1). Appendix B lists the 12 
GAO reports published between Mareh 2006 and May 2010 that are related to seeondary 
item inventory, summarizes their findings, and briefly states how the plan will address 
eaeh finding. Appendix C reprints the entirety of seetion 328 of the FYIO NDAA, while 
Appendix D provides a list of abbreviations. 

While the plan is a eomprehensive approaeh to improving materiel management, 
only ehapter II, Sub-Plan A: Demand Forecasting, is relevant to our researeh. The overall 
objeetive of sub-plan A “is to improve the predietion of future demands so that inventory 
requirements more aeeurately refieet aetual needs” (ASD[L&MR], 2010, p. 2-3). In order 
to aeeomplish this objeetive, the DOD did a thorough review of eurrent foreeasting 
proeedures and methodologies in seareh of ways to improve the proeess. As a result of 
this review, the DOD established five aetion items that required further work to address 
the issues with demand foreeasting. Of these five aetion items. Action A-2: Implement 
Standard Metrics to Assess Forecasting Accuracy and Bias is the basis for this researeh 
projeet. DOD targeted the end of fiseal year 2011 to identify these two me trios and the 
end of fiseal year 2012 to establish the prooesses by whieh the DOD oomponents oould 
set targets and begin utilizing the oommon me trios. The aoouraoy metrio intends to 
measure foreoast performanee while minimizing bias and generating results for various 
inventory segments. The bias metrio intends to identify over- and under-foreoasts in order 
to prevent inventory exoesses and defioits. 

3. Post CIMIP 

In January 2011, GAO published its required 60-day assessment of the DOD’s 
plan to meet the eight elements identified in seetion 328 (GAO, 2011). While GAO 
oonoluded that the plan did address all eight elements from seetion 328 of the FYIO 
NDAA, the report identified five general areas that oould produoe implementation 


5 



challenges if not managed properly. One of the examples the report used to highlight 
potential friction areas was the requirement to develop a standard accuracy metric and 
performance targets. GAO felt that this level of standardization could be diffieult to reaeh 
given the fact that the services and DLA had different approaehes to measuring demand 
foreeast accuracy (GAO, 2011, p. 6). In May 2012, GAO fulfilled its final requirement 
from section 328 by publishing its 18-month assessment of the effectiveness in which the 
serviees and DLA have implemented the plan they developed. GAO eoncluded that while 
the DOD was “making progress towards...establishing a department-wide set of 
standardized metrics for inventory management. Moving forward, DOD’s inventory 
management improvement efforts would benefit from challenging, but achievable targets 
for redueing its on-order and on-hand exeess inventory” (GAO, 2012, p. 30). Within the 
demand-forecasting sub-plan, GAO determined that while DOD had sueeessfully 
developed the forecast accuracy and bias metrics, the effeetive implementation of these 
metrics still required a sustained effort to meet the expeeted completion date of 
September 2012. The aecuracy metrie that was developed is an absolute error metric, 
while the bias metric is a signed error metric. The formulas for these two metrics are 
discussed further in Chapter 11 and are shown in Equations (2.24) and (2.25). 

Reinforcing CIMIP efforts, the aeting Under Secretary of Defense for 
Acquisition, Technology, and Logistics (USD[AT&L]) signed DOD Instruction 4140.01 
in December 2011, establishing that DOD’s supply chain materiel management “shall 
operate as a high-performing and agile supply ehain responsive to customer requirements 
during peacetime and war while balancing risk and total cost” (Kendall, 2011). In 
addition to clearly defining policy and assigning responsibility for management of 
material across the DOD supply ehain, this instruction laid out the framework for 11 
DOD Supply Chain Material Management Procedures manuals. In February 2014, the 11 
manuals were published as volumes 1 through 11 of DOD Manual 4140.01 with eaeh 
covering speeifie supply chain procedures. Volume 2, Demand and Supply Planning, 
among other things provided guidanee on how DOD components should forecast 
customer demand. Volume 10, Metrics and Inventory Stratification Reporting, required 
among other things that the DOD utilize metries that were specific, measureable. 
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actionable, realistic, and timely, which included demand forecast accuracy as an example 
of such a metric. 

In April 2015, GAO released its most recent report related to defense inventory 
management, concluding that the services had generally been able to reduce their excess 
inventory, which was the primary objective of section 328 of the FYIO NDAA. Although 
this result was positive, GAO had seven recommendations to improve how DOD 
managed inventory. While GAO recommended that DOD establish goals for these 
metrics, DOD wanted to collect more data to establish a performance baseline before 
setting any department-wide goals (GAO, 2015b, p. 43). The report also reviewed results 
from the first and second metrics reporting periods. These metric results are reported 
semi-annually for the preceding 12-month period, so the first period covered all 12 
months of FY13 ending in September 2013. The second period covered the last six 
months of FY13 and the first six months of FY14 ending in March 2014. Figure 2. and 
Figure 3. show the results reported by three services during these two 12-month 
reporting periods. The figures do not include the results for DLA or the non-aviation 
material for the Marine Corps. The Marine Corps aviation material is included in the 
Navy results. 
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Figure 2. Demand Forecast Accuracy Performance by Service. Source: GAO 

(2015). 
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The Air Force reported the highest forecast accuracy for these periods. The Army showed 
the greatest improvement over the two reporting periods. 


Figure 3. Demand Forecast Bias by Service. Source: GAO (2015). 
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The Army had the largest bias for over-forecasting demand, followed by the Navy and the 
Air Force. In the second reporting period, the Air Force reported a negative bias, which 
indicates that they were under-forecasting their demand. 

8 





















In response to the Navy’s relatively poor performanee in both forecast accuracy 
and bias, Naval Supply Systems Command (NAVSUP) reported that they were 
“reviewing and analyzing their demand forecasting processes and planning factors to 
improve performance on DOD’s forecast accuracy and bias metrics tracked across the 
department” (GAO, 2015b, p. 46). 

B, DATA DESCRIPTION AND RECENT RESULTS 

The business rules for calculating the DOD’s demand forecasting accuracy and 
bias metrics that were provided to DLA and each of the services specify eight forecast 
data elements and two demand history data elements that should be included in their data 
captures (DOD, 2013). These elements were 

Forecast Data Elements 

• NUN / family head / subgroup master 

• Demand forecast (monthly/quarterly/semi-annually) 

• Latest acquisition cost or moving average cost 

• Reparable/consumable indicator 

• Unit of issue 

• Unit of measure 

• Time frame of the forecast (start date) 

• Date the forecast was made (forecast date) 

Demand History Data Elements 

• Actual demand 

• Timeframe of demand 

NAVSUP Weapon Systems Support (WSS) provided CIMIP compliant data for 
fiscal years 2013, 2014 and 2015. The raw data elements provided the national individual 
identification number (NUN), quarterly demand forecast, repair indicator, stock routing 
code, replacement cost, acquisition advice code, performance based logistics indicator, 
family group code, unit of measure, life cycle indicator (LCI), cognizance code, actual 
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annual demand, and annual naive forecast. The FY14 and FY15 data calculated 
additional elements such as annual demand forecast, total dollar calculations, absolute 
and signed errors, and line item forecast accuracy and bias metrics. The FY15 data also 
included a bar graph of the Navy’s overall CIMIP results reported to DOD for the five 
previous 12-month evaluation periods (Figure 4). 


Figure 4. Navy CIMIP Forecast Metric Results FY13-FY15. Source: 

NAVSUP (2015). 


Navy Forecasted Items 



Accuracy and bias results are reported to DOD semi-annually for the preceding 12-month 
period, which creates a six-month overlap in the data. The accuracy result is an absolute 
error metric that summarizes the Navy’s forecasting performance. The bias result is a 
signed error metric that represents the degree of over-forecasting 


C. PURPOSE AND BENEFITS OF STUDY 

This research effort intends to review the validity of the DOD’s newly 
implemented CIMIP forecasting metrics and identify weaknesses that may not be 
apparent to the casual observer of forecast accuracy metrics. We also intend to provide 
recommendations that will improve the DOD’s efforts to increase forecast accuracy, 
which should result in better forecasts in the future, decreasing levels of excess inventory 
and ultimately, substantial cost savings to the DOD. While we certainly appreciate the 
complexity of forecasting future demand and accurately measuring those forecasts, and 
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recognize the amount of effort that has already been devoted to this issue, we will 
demonstrate that our research can provide value to these DOD efforts. Even if the DOD 
disregards our recommendations, there are still opportunities for the Navy, or the other 
services, to implement our recommendations and improve their demand forecasting 
efforts. 

D, RESEARCH QUESTIONS 

In 2011, GAO declared that “inaccurate demand forecasting is the leading reason 
for the accumulation of excess inventory” (p. 11), and as Figure 5 shows, the Navy has 
been making steady progress in reducing its on-hand excess inventory; however, despite 
this good news, the CIMIP forecast results have not significantly changed (Figure 4). 


Figure 5. Navy On-Hand Excess Inventory, Sept. 2012 to Mar. 2014. 

Source: GAO (2015). 
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The vertical bars represent excess inventory as a percentage of total inventory. The bottom 
table shows inventory dollar values in billions. While total inventory value has remained 
constant, whether you exclude contractor-managed inventory or not, excess inventory has 
been decreasing in real dollar values and as a percentage of total inventory. 
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Although many factors contribute to excess inventory levels, if the “leading 
reason for the aeeumulation of exeess inventory” (GAO, 2011, p. 11)—foreeast aecuraey 
—is not improving or getting worse while exeess inventory is deelining, then this raises 
the question of whether demand foreeasting is aetually the largest eontributor; or 
alternatively, if foreeasting performanee is not being aeeurately measured. Our intuition 
is that the answer lies in the seeond justifieation, and we intend to show it by addressing 
the following questions: 

• Does the CIMIP foreeasting metrie eapture foreeast error in a way that is 
aetionable? 

• Are the CIMIP foreeasting aeeuraey results impacted by variables or data 
set charaeteristics that are not direetly related to foreeast error? 

• Does the CIMIP foreeasting metrie provide a useful produet to the 
foreeasters that enables them to prioritize their foreeast improvement 
efforts? 

• Is there an alternative foreeast aeeuraey equation that both enables the 
aggregation of aeeuraey results for multiple line items, with various units- 
of-measure, while also providing actionable results at the item level? 

Finally, it is also important to investigate how the foreeast aeeuraey can generate 
valuable information to the Navy’s managers. 

E. SCOPE, ORGANIZATION AND METHODOLOGY 

While inventory management improvement efforts span a large range of topies 
detailed in the 2010 CIMIP, this researeh will focus primarily on one line of effort to 
improve demand foreeasting: the measurement of foreeast aeeuraey. Chapter II is a 
summary of the traditional aeademie aeeuraey metrics and a compilation of the most 
valuable findings in the existing literature. Chapter III aims to present an in-depth 
analysis of the CIMIP equation and a eomparison to an alternative aeeuraey metrie. Mean 
Absolute Sealed Error (MASE). Those analyses are eomposed of speeifie tests to uneover 
the existenee of inherent flaws or undesirable eharaeteristies in the eurrent metrie. In 
order to eompare the aecuraey metrics, we assess them utilizing four desirable 
eharaeteristies. The tests we will eonduet utilize three different methods, aceording to 
speeifie purposes. The first method uses fictional numbers, the seeond uses real numbers 
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extracted from available data, and the third generates Monte Carlo Simulations using the 
Crystal Ball program. 

Although we did not intend to make this a discussion on forecasting methods, the 
interrelatedness of forecasting methods and results measurement make it unavoidable. 
Therefore, in Chapter IV, we analyze alternative ways to generate more accurate demand 
forecasts. In Chapter V, we summarize the most important findings, make 
recommendations for DOD and Navy, and propose future areas of research to continue to 
advance the effectiveness of DOD and Navy forecasting efforts. 
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II. LITERATURE REVIEW 


A, INTRODUCTION 

This chapter presents a review of the evolutionary path of knowledge in the field 
of foreeast aeeuraey, while also providing an overview of the most popular foreeast 
aceuraey measures. 

B, FORECAST ACCURACY 

Demand foreeasts are a key eomponent to effeetive inventory management. 
Delivery of inputs to produetion takes time and, even considering a deterministic 
scenario, managers need to be preeise in determining the correet time to transmit their 
orders to suppliers, in order to avoid eosts from shortages or by holding excess inventory. 

Reinforeing that idea, Makridakis and Hibon (2000) elaim that “foreeasting 
aeeuraey is a eritieal faetor for, among other things, redueing eosts and providing better 
customer service” (p. 451). The effeets of an inaoeurate predietion are intensified when 
variability takes plaee, making the importanee of foreeast aeeuraey even more important. 

1. History of Forecast Accuracy Measurement 

Over the last 50 years, researehers have invested eonsiderable time and effort to 
inerease the understanding of foreeast accuracy. While there is not a eonsensus about the 
first aeademie artiele on forecast accuracy, Ferber (1956) and Schupack (1962) are 
eonsidered pioneers in this field. They tested multiple foreeasting methods, using 
eorrelation index and various aeeuraey metries to determine whether foreeast models that 
demonstrated a good fit to past data eould then generate good foreeasts. The results did 
not support this hypothesis and they eoneluded that best fit on past data is not a good 
measure of foreeast aeeuraey. Moreover, forecast method rankings do not ehange mueh 
by using different foreeast aeeuraey metrics and there is no absolute best foreeast method. 

As eomputer proeessing eapabilities grew, researehers eould proeeed with broader 
studies to measure the aeeuraey of different foreeast methods. Fildes and Makridakis 
(1995) found that in the 20 years from 1971-1991, approximately 130 articles per year 
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were published in the Journal of the American Statistical Association (JASA) related to 
time series analysis. For example, Newbold and Granger (1974) used average squared 
foreeast errors to assess the aecuracy of three forecast methods, each one applied to 106 
time series. A few years later. Nelson and Granger (1979) were able to analyze five 
forecast methods through twenty-one time series, utilizing ten forecast horizons and two 
different accuracy metrics. 

Newbold and Granger (1974) were able to make insightful conclusions regarding 
the use of non-automated forecast methods. They found that the Box-Jenkins forecast 
method was capable of making up for its significantly longer calculating time by 
producing more accurate estimations. Moreover, results from that forecasting method 
could be further improved by combining them with other fully automated procedures, 
like Holt-Winters or a stepwise autoregressive forecast. They also provided guidelines to 
optimize the choice of forecast methods according to the length of the time series. The 
idea of combining forecast methods in order to increase accuracy is one of the most 
valuable contributions in the field of forecasting and was first investigated by Reid 
(1968) and was further discussed by Nelson (1972), Cooper and Nelson (1975), 
Makridakis and Winkler (1983), Nelson (1984), Clemen (1989), Fildes (1989), among 
many others. 

By the late 1970s, the question of what is the best forecast method seemed to be 
far from a solid answer. Utilizing the increasing power of computing capabilities and 
availability of new knowledge in the field of time series, Makridakis et al. (1979) and 
(Makridakis et ah, 1982) conducted accuracy analysis on a much greater number of 
forecast methods. 

Moreover, Makridakis et al. (1982) was the first empirical study of what became 
known as the M-1 Competition, which began the M-series Competitions. Makridakis et 
al. (1993) and Makridakis and Hibon (2000) published the M-2 and M-3 Competitions, 
respectively, which attempted to uncover situations in which one forecast method is 
expected to outperform others. 
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M-1 Competition in 1982 was based on the consensus of nine authors and made 
important contributions to the literature. It analyses 24 different forecast methods using 
1,001 time series and five accuracy metrics: Mean Average Percentage Error (MAPE), 
Mean Squared Error (MSE), Average Ranking (AR), Medians of Absolute Percentage 
Errors (MdAPE), and Percentage Better (PB). The major findings of M-1 Competition 
are that there is no forecast method capable of minimize forecast errors in all kinds of 
demand patterns; more complex forecast methods do not always outperform rudimentary 
ones; and the best technique changes from one forecast horizon to the next and when 
different measures of accuracy are used. 

The M-1 Competition also developed the categorization of time series in order to 
allow for the possibility of one technique to perform better when specific circumstances 
are present. That method is in accordance to Gilchrist (1979), which affirmed that 
averaging accuracy measures for several time series might hide the ability of a forecast 
method to deal with one specific type of time series better than others. However, one may 
infer that the way the time series were then grouped may have influenced the results. 

Those findings were criticized by Armstrong and Eusk (1983) who identified the 
lack of interpretation or discussion about the results as an opportunity to open a 
discussion among experts aiming to clarify important aspects of forecast accuracy. 

In order to address critics related to organization of results, M-2 Competition in 
1993 made a simpler analysis, evaluating 16 forecast methods, each one applied to 29 
time series, just using one accuracy metric, MAPE. It concludes in favor of both the 
exponential smoothing and the Dampen and Single smoothing methods, considered as 
being among the simplest. It also found that relatively sophisticated forecast methods are 
expected to perform better when randomness of series is small. 

The M-3 Competition in 2000 moved back to extensive analysis, while as many 
as 3000 time series were used to generate forecasts, using 24 different methods, which 
accuracy were measured by five metrics: MAPE, AR, Median Symmetric Absolute 
Percentage Error (MdSAPE), PB and Median of Relative Absolute Error (MdRAE). It 
rejected the argument that more complex methods outperform simpler ones. It found that 
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the best method varies aeeording to the aeeuracy metrie used and that a eombination of 
foreeast methods is able to inerease foreeast aeeuracy. 

Armstrong and Collopy (1992) presented a different approach on the use of 
forecast accuracy, as it evaluated measures of forecast accuracy, instead of forecast 
methods themselves, by using 191 economic time series. They provided a new approach 
to judge accuracy metrics, by using a framework composed by reliability, construct 
validity, sensitivity to small changes, protection against outliers, and relationship to 
decision making. Final conclusions were favorable to the use of MdRAE as an accuracy 
metric. 

Following that discussion, Hyndman and Koehler (2006) provide a 
comprehensive critical survey of accuracy measures to uncover significant inadequacy in 
all of them. They sort the accuracy metrics into five categories: scale-dependent 
measures, measures based on percentage errors, measures based on relative errors, 
relative measures and scaled errors; describe each category and provide critical analysis 
of their weaknesses. Acknowledging inherent flaws of the existing accuracy metrics, they 
propose MASE. The metric was retroactively applied to the M-3 Competition data to test 
its potential. 

The most important findings were that MASE can be used in all patterns of 
demand, that it produced results in accordance to what was found by Makridakis and 
Hibon (2000) about best-performing methods, and that MASE represented a more 
powerful test than any other metrics, since its results show more significant differences 
between forecast methods. 

Finally, after considering the existing literature, Fildes et al. (2008) claim that 
“establishing an appropriate measure of forecast error remains an important practical 
problem for company forecasting”. 

2, Traditional Academic Measures of Forecast Accuracy 

A starting point to discuss forecast accuracy measurement is that it is based on 
observation of errors. Those errors are comparisons between the demand that what was 
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forecasted for a given period of time and actual observation during that same time period. 
Therefore, the most basic idea about forecast accuracy is that a better forecast method is 
expected to produce smaller errors. 

Furthermore, forecast accuracy can be considered a two-dimensional problem. 
One can think in terms of measuring accuracy over many periods of time for one item, 
while others may need a number that represents the goodness of forecast method for 
many items in the same time period. Table 1. exemplifies the generation of forecast 
accuracy values in both dimensions mentioned. 


Table 1. The Two Dimensions of Forecast Accuracy 
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2.67 
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3.33 


Mean of Absolute 

Errors in time 1 

4 

Mean of Absolute 

ii*rors in time 2 

2.75 

Mean of Absolute 

Errors in time 3 

3.25 


Mean of Absolute Errors is one of the existing forecast accuracy metrics. It can be 
calculated either at the line item level or at the aggregated level, for each period. In this 
case, the forecast method used performed better for items 1 and 4, while period 2 was the 
time in which the overall forecast accuracy was considered the best. Considering the scale 
dependency of that metric, discussed in the Chapter II, this hypothetic data set assumes 
that all line items have the same unit. 


First, it is possible to isolate one time series, for example, the repeated demand for 
one item, and compute the accuracy along the time, which is called by Hyndman and 
Athanasopoulos (2014) as a type of time series cross-validation. Fildes et al. (2008) 
reinforce the importance of this process by claiming that forecasters should measure 
accuracy as a result of sequential errors. 

One particular way to conduct such analysis is to calculate errors, for specific 
times, by comparing one period forecast and actual values. Afterwards, there is a variety 
of ways to combine errors and produce significant information about accuracy of forecast 
for that specific item. 
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However, Fildes et al. (2008) points out that “a eornmon requirement, within an 
organization, is to provide a one-figure summary error measure, for many different time 
series” (p. 1158). That proeedure is also known in literature as aggregation, whieh is both 
eritieized and defended by many studies, like Jenkins (1982), Fildes and Makridakis 
(1995) and Hyndman and Koehler (2006). 

In order to enable aggregation, Fildes and Makridakis (1995) affirm that errors 
must be standardized. In fact, Hyndman and Koehler (2006) applied scaled errors as a 
form of standardization, thus enabling aggregation by simple average. 

Therefore, we infer that an effective measure of accuracy should be able to 
produce results for both dimensions. However, as we could not find any further 
discussion about the best way to aggregate accuracy values, hereafter, we are going to 
discuss a variety of metrics used to calculate forecast accuracy across time, which are 
exhaustively discussed in literature and often used by organizations. 

To do so, we are going to present the most common accuracy metrics using the 
same taxonomy found in Hyndman and Koehler (2006). Basically, we review the many 
possibilities of handling the error, which is calculated as: 

( 2 . 1 ) 

where: 

Ct = forecast error at a given time 

ft = forecast value at a given time 

Ut = actual value at a given time 

a. Scale-Dependent Metrics 

Metrics that fall in this category generate values accompanied by their respective 
units. Their use has to be restricted to series cross-validation in order to avoid the 
problem of mixing units of different items. That is the main source of criticism to M-1 
Competition, in Makridakis et al. (1982), since it inappropriately uses the MSB across 
time series. 
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The most common scale-dependent measures are: 

Mean Squared Error 

MSE - Mean(e^f) (2.2) 

Root Mean Squared Error 

RMSE = ^mean{e^ (2.3) 

Mean Absolute Error 

MAE = mean (2.4) 

Median Absolute Error 

MdAE = median \e, | (2.5) 

All equations in this category use central tendency measures. It is worth noting 
that means and medians are the extreme opposites in terms of sensitiveness to outliers. 
Hence, large errors will dominate the results in formulas based on means and cause 
almost no change in results of formulas based on medians. Therefore, in both cases the 
quality of the results are harmed. 

Additionally, measures that use squared errors have the potential to penalize large 
deviations, in comparison to small ones, which make them appear attractive to some 
managers. However, their use was tested and not recommended by Armstrong and 
Collopy (1992) and Armstrong (2001), due to the disproportional harm caused by 
outliers. 

b. Percentage Errors Metrics 

Hyndman and Koehler (2006) define percentage error (p^) by the following 
equation: 

p^=\^ej ( 2 . 6 ) 

Means, medians and squares are applied to pt to derive new forecast accuracy 
metrics. The most common percentage error measures found in literature are: 

Mean of Absolute Percentage Error 

MAPE = me an\p\ (2.7) 
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Median of Absolute Pereentage Error 


MdAPE = median 


( 2 . 8 ) 


Root Mean Square Pereentage Error 

RMS PE = ^mean{p^ (2.9) 

Root Median Square Pereentage Error 

RMdSPE = ^median{p^ (2.10) 

An inherent flaw with pereentage error (pd is that it produees an infinite result 
when a; = 0. Therefore, none of these me tries are reeommended in data sets that eontain 
aetual demand values equal to zero. 

Additionally, Tayman and Swanson (1999) state that “MAPE does not meet the 
eriterion of validity, as it systematieally overstates the average error of estimates, 
therefore, harming the degree of eorrespondenee between its measures and aetual values” 
(p. 299). 

Eurthermore, Makridakis et ah, (1993) notieed that these metries also penalize 
positive and negative errors differently beeause negative errors (et < 0), in terms of 
inventory, are limited to the amount of the aetual value (ad, while positive errors (et > 0) 
are unbounded. In order to deal with that, he defined symmetrie measures: 

Symmetrie Mean Absolute Pereentage Error 

sMAPE = mean(200 - f,\/ {(^, + f, )) (2.11) 

Symmetrie Median Absolute Pereentage Error 

sMdAPE = median{200 |a, - /^| / (a^ + /,)) (2.12) 

However, while Hyndman and Koehler (2006) found that these metries redueed 
the unwanted effeets eaused by small aetual demand values, it did not eompletely solve 
the problem. Moreover, some studies proved that these metries are not as symmetrie as 
they were supposed to be, Goodwin and Eawton (1999) and Koehler (2001). 
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c. Relative Error Metrics 

These metrics are based on the division of an error produced by one forecast 
method, by the error of another forecast method, which serves as a benchmark method. 
Often, the benchmark forecast method consists of just a replication of previous period 
values, which Hyndman and Koehler (2006) define as random walk. That procedure is 
also known in literature as the naive method Makridakis et al. (1993). Hence, relative 
error (r,) is expressed by the following equation: 

r,^eje\ (2.13) 

where, e*t is the error produced by the benchmark method, at time t. 

The most common relative error measures are: 

Mean Relative Absolute Error 

MRAE = mean\r^ | (2.14) 

Median Relative Absolute Error 

MdRAE = median | rj (2.15) 

Geometric Mean Relative Absolute Error 

GMRAE = gmean | r, | (2.16) 

Scrutinizing the relative error equation, we found that it is inherently flawed when 
the error produced by the benchmark method is zero and relative error goes infinite, or 
very small benchmark errors induce extremely high relative errors. 

Regarding that issue, Armstrong and Collopy (1992) proposed a particular way to 
soften the mentioned effect by trimming results, the so-called Winsorizing. Basically, 
they attributed fixed values when benchmark errors are under or above certain thresholds. 
According to Hyndman and Koehler (2006), this procedure increases complexity and 
inserts arbitrariness. 
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d. Relative Metrics 

Instead of simply dividing errors, these metries are based on dividing results of 
one aeeuraey metrie, regarding errors produeed by different foreeast methods. Therefore, 
Relative Mean Absolute Error is the division of MAE generated by one foreeast method 
by MAE generated by a seeond method, hollowing are some of the possible metries: 

Relative Mean Absolute Error 

RMAE = MAEJMAE^ (2.17) 

Relative Root of Mean Squared Error 

RRMSE = RMSE^ / RMSE, (2.18) 

Relative Median Absolute Error 

RMdAE = MdAE^ / MclAE^ (2.19) 

Relative Mean Absolute Pereentage Error 

RMAPE = MAPE^ / MAPE^ (2.20) 

As the name of this group of metries suggest, the results are given in relation to 
another foreeast method. Henee, values from zero to one mean better foreeast, eompared 
to foreeast method. When result is one, there is no signifieant differenee among the 
eonsidered foreeast methods. Results bigger than one mean that foreeast method used 
performed worse than the benehmark. Hyndman and Koehler (2006) eonsider the 
eharaeteristie of easy interpretability as an advantage of these metries. 

The only limitation found is that it is impossible to use these metries aeross items, 
regarding just one period in time, sinee they use scale dependent measures in numerator 
and denominator that do not allow aggregation of different time series. 

Wheelwright et al. (1998) mentions a specific relative metric, called TheiTs U 
Statistic and its variation, TheiTs U-2 Statistic. Theil developed the first of those metrics 
in 1966, and it was modified into the second one in 1978. The article claims that TheiTs 
U-2 statistic is just a particular case of RMAE, when the benchmark method is the naive 
and forecasts are generated to one period ahead. 
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Another metric that uses the same principle of relative measures is PB. It is the 
percentage of times that one measure performs better than another, using any kind of the 
mentioned accuracy measures. Hyndman and Koehler (2006) mention two disadvantages 
of this metric. First, it is not sensible to the size of errors and second, it does not provide a 
clear idea of how much improvement is possible. 


e. Scaled Error Metric 

Hyndman and Koehler (2006) developed a new metric based on the principles of 
Relative Error Metrics and Relative Metrics. The rationale is to solve existing problems 
in the mentioned metrics by dealing with scaled errors (qt). The scaling factor, 
denominator of the scaled error, is the MAE of in-sample values of a benchmark forecast 
method. 


The scaled error is defined by the following equation: 


where. 



( 2 . 21 ) 


j = sample time index 

k = time index of the last in-sample observation 

Hence, the error measured in a given time (et) is divided by the MAE of a 
benchmark forecast method, only considering the in-sample time period. 

Hyndman and Koehler (2006) propose a particular type of scaled error, in which 
the benchmark is the naive method. Because of that, the identity fj - a-_^ can be applied 
to adjust the equation. Moreover, they assume that the in-sample data comprehends 
periods from 1 to k. That makes the difference — f- applicable from period 2 to k, as 

the first fj value possible uses value. As result of that, there are k-1 observations to be 
considered in the denominator of . 


Applying the mentioned adjustments, the following equation results: 
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q, = 
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( 2 . 22 ) 


k - 


,ZI 

^ i=2 


\a.-a.A 


After that, the Mean of Absolute Scaled Error is just given by: 

MASE = mean \q^ | (2.23) 

The interpretation of results has to follow the same instructions as exposed for 
relative measures. The only case that scaled error equations do not work, is when all in- 
sample errors equal zero. We were also not able to find any negative critiques of this 
metric in literature, so because of these factors we choose MASE to be our metric of 
choice to compare against the accuracy metric proposed in the CIMIP. Additionally, in 
Chapter III, we present a further discussion on the importance of using benchmarks when 
measuring forecast accuracy. 


3, Forecast Accuracy Metrics Currently Used in the Defense 
Environment 

As part of CIMIP implementation. Office of the Secretary of Defense (OSD) 
established two metrics to measure forecast accuracy and forecast bias, while components 
already had their own ways to keep track of the goodness of their forecasts. This section 
aims to introduce the equations used by DOD and Navy, presenting brief comments about 
their main features. 


a. DOD’s Forecast Accuracy Metrics 

The challenge with a common metric that is self-reported is to ensure that each 
group is calculating the metric correctly. To address this issue, the DOD published 
internal business rules to standardize the reporting effort among the components (DOD, 
2013). As mentioned in Chapter I of our research, the CIMIP metric required specific 
data elements of the forecast and demand history, yet these business rules also detail what 
data should not be included. As stated in the introduction to the business rules document, 
the CIMIP “forecasting metrics are not the mechanism to reduce error; however the 
metrics will create a common baseline from which to measure the impact of other 
initiatives” (DOD, 2013, p. 2). The results of these forecast accuracy and bias metrics are 
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to be reported semi-annually at the DOD’s inventory management reviews, as well as 
monitored by the CIMIP foreeasting, total asset visibility, multi-eehelon modeling 
working group and the supply ehain metrics group (DOD, 2013). 

The components are responsible for collecting all of the data necessary to 
compute the metrics, which should include all items for which the components use some 
type of forecast algorithm. This excludes items whose requirements determination is 
impacted by component business rules, performance-based contracts and foreign military 
sales. The metric also excludes unforecastable items, which either do not have a demand 
forecast rate, or whose forecast and actual demand during the reporting period is equal to 
zero. Although the components are free to generate forecasts with the method and time 
horizon of their choosing, they are required to insert 12-month forecasts and actual 
demands in the calculations. 

The implementation of standard metrics to assess forecast accuracy and bias is 
one of the required actions, contained in CIMIP, to address the DOD need for better 
forecasts. From this point on, we are going to refer to those metrics as being CIMIPf, 
aggregated forecast accuracy obtained at a given period of time, and CIMIPb, forecast 
bias, as follows: 

CIMIPj = - *100% (2.24) 

(=1 

CIMIPf = -* 100% (2.25) 

i=l 

where, 

n = number of items in the forecast dataset 

Ci = unit cost for item i 

fi = demand forecast for item i 
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a, = actual demand for item i 


A elose look at CIMIP^ metrie reveals a eertain similarity to MAPE, Equation 

(2.7). The first notable differenee is the one minus before the fraetion. It implies the 
rationale that accuracy is better when error is small and does not represent any harm to 
the interpretation of results. Another important differenee is that CIMIP^ is a division of 

summations, instead of a summation of divisions. Additionally, we assume that CIMIPj ^, 

as an inventory foreeast aeeuraey metrie, uses unit eosts to weight the importanee of 
expensive items within the dataset and not as an evaluation of budget impaets. 

As mentioned in the introduetion, the aeeuraey metrics contained in CIMIP are 
the eentral issue of this researeh. Therefore, eareful diseussion and evaluation about those 
eharacteristics are presented in Chapter III. 

b. Navy’s Forecast Accuracy Metric 

GAO eritieized NAVSUP’s seeondary inventory management and reeommended 
that it “evaluate and improve demand foreeasting proeedures,” (GAO, 2008, p. 5). Then, 
a NAVSUP team developed the Lead-time Adjusted Symmetrie Error (EASE), as their 
demand foreeast aeeuraey metrie, more than a year prior to the release of the CIMIP 
foreeast aeeuraey metrics (Bencomo, 2010). 

After determining that traditional aeeuraey measurements, sueh as MSE and 
MAPE, were insuffieient, they eombined two proposed solutions for ealeulating 
pereentage-error for intermittent demand: sMAPE and Denominator-Adjusted MAPE 
{DAM) (Hoover, 2006). 

The advertised benefits of the EASE metrie were that it is eapable to provide 
results with demand data that is highly intermittent, it does not generate a division-by¬ 
zero error, and it returns a symmetrieal assessment of over and under foreeasting. The 
equation follows: 
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LASE 


[(/,+«',)/ 

Actually, LASE equation is a combination of two aspects present in Hoover 
(2006). The first is that sMAPE, Equation (2.11), is a good way to measure forecast 
accuracy, when forecast or actual demand is different from zero. The second is that when 
forecast and demand are zero, managers should adjust the denominator by applying the 
addition of one. However, instead of applying the denominator adjustment only in cases 
that forecast and actual demand are both zero, the LASE metrie applies the adjustment as 
a general rule. This characteristic aims to ensure consistency, as opposite to the use of 
different criteria for different items. 


2] + l 


The following equation is a more consistent version of the LASE equation to the 
one proposed in Hoover (2006): 


LASE 


[(/^ + aJ/2] + l 


(2.27) 


being. 


if f +(2 — 0, then / = 0 and J =1', 


if f +(2 ^0, then 7 = 1 and J = 0. 

However, we consider the complexity of EASE’ as a drawback, as well as its lack 
of criteria consistency, as different items are subjected to different rules. 

One year after the metric was released, Jackson (2011) demonstrated that the 
LASE metric had an inherent smoothing effect that hampers the identification of large 
divergences between the forecast methods. By the end of the study, he concluded against 
of its use. Despite that, NAVSUP continues to utilize the LASE metric as an internal 
managerial tool to measure forecast accuracy. 


C. CHAPTER SUMMARY 

The forecast accuracy field of research has significantly evolved during the last 
sixty years, following the evolution of eomputing capabilities. Massive analyses and deep 
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considerations, in literature, provide relevant findings. From those, we highlight the 
following as the key learning points of this Chapter: 

• There is no absolute best forecast method. 

• More complex forecast methods do not always improve accuracy. 

• Combining forecast methods will likely result in more accurate foreeasts. 

• Forecast accuracy can be measured aeross two dimensions: the first is time 
and the seeond is line items. 

• Scale dependent metrics do not allow aggregation of results. 

• Pereentage error metrics are vulnerable to zero aetual demand. 

• Relative error metries and relative metries are vulnerable to the oceurrence 
of any zero error. 

• MASE avoids the flaws of many traditional metrics and remains in good 
standing among academie literature reviews. 

Separate from the evolutionary proeess of academic literature on forecast 
accuracy, the DOD and Navy developed their own forecast accuracy metrics, respectively 
C/M/F/and LASE, in an attempt to quantify and improve their foreeasting efforts. 
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III. ANALYSES ON CIMIP FORECAST ACCURACY METRIC 


A, INTRODUCTION 

This chapter will examine whether the eurrent DOD foreeast aeeuraey metrie has 
any inherent flaws and if there are any alternative foreeast aeeuraey metries that avoid 
these flaws and produce higher quality, more relevant results. 

B, EVALUATION OF CURRENT METRIC 

At first glanee, the CIMIPf metrie, Equation (2.24), appears to be similar to 
MAPE, Equation (2.7), whieh as we discussed in Chapter II is a traditional foreeast 
aeeuraey metrie. The main difference between the two metries is that MAPE is a 
summation of divisions, while CIMIPf is a division of summations that ineludes unit eosts 
as a way to eonvert values to a eommon unit of measurement and prioritize the foreeast 
performanee of expensive items. 

While MAPE is a broadly studied, traditional metric, it contains specific flaws that 
limit the scope of it applicability. In this section, we will investigate whether those 
differenees, along with other specifie characteristies, make CIMIPf a reeommendable 
managerial tool to assess foreeast aeeuraey. 

1. Division of Summations 

One of the main objectives of any foreeast aeeuraey metric that utilizes division 
of a numerator by a denominator is to avoid unit-of-measure dependenee in order to 
enable aggregation of results aeross a range of produets. CIMIPf , on the other hand, 

aggregates the results into dollars, by ineluding unit eosts, in both the numerator and 
denominator before the division oeeurs. This division of the total foreeast error in dollars 
by the total aetual demand in dollars produees a seale-free, dollar-weighted result. 

To illustrate the methodologie difference, we eompare CIMIPf metric to a cross- 
sectional extension of MAPE, in the manner that they determine their results. The 
equation for that variation of MAPE is: 
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MAPEj = mean p. 


( 3 . 1 ) 


where; 

€ 

Pi =— ; and 
a. 

e^^fi-Ui 

MAPEf calculation first obtains the absolute percentage errors \pt\ at the item 
level, then the individual results are averaged. CIMIPf first converts the numerator and 
denominator for each item into dollars, proceeds the summations the numerators and 
denominators separately, and then divides one by the other to generate a forecast 
accuracy result that represents the entire population. In this example we have adjusted 
MAPE to the aggregated level to enable comparison, yet we could have adjusted CIMIPf 
to the individual level to accomplish the same. Later, Equation (3.2) will present this 
extension of CIMIPf. Table 2. and Table 3. provide an example of the methodologic 
distinction. 


Table 2. MAPE Calculation 


Items 

fi 


Ci 

Pi 

1 

23.84 

32 

-8.16 

25.5% 

2 

21.26 

20 

1.26 

6.3% 

3 

0 

2 

-2 

100% 

4 

235.42 

151 

84.42 

55.9% 




MAPEf 

46.93% 


The far right column shows how MAPE first calculates individual absolute percentage 
errors and then averages them to get the final value. 


Recalling C/M/P/metric: 


Equation (2.24); CIMIPf 



/=! _ 

n 

(=1 


* 100 % 
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Table 3. ClMlPf Calculation 


Items 

fi 

ai 


Ci 

1 fi-ai 1 

Ci*ai 

Ci* 1 fi-a, 1 

1 

23.84 

32 

$ 1,354,173.00 

8.16 

$ 43,333,536.00 

$ 11,050,051.68 

2 

21.26 

20 

$ 

43,125.00 

1.26 

$ 862,500.00 

$ 54,337.50 

3 

0 

2 

$ 

32,815.00 

2 

$ 65,630.00 

$ 65,630.00 

4 

235.42 

151 

$ 

260,000.00 

84.42 

$ 39,260,000.00 

$21,949,200.00 


Sum $ 83,521,666.00 

$ 33,119,219.18 

CIMIPf 60% 



The two far right columns of Table 2 demonstrate how CIMIPf sums the numerator (total 
dollar error) and denominator (total dollar demand) separately before dividing them, 
subtracting from one and then multiplying by 100 to generate the final C/M/P/ value. 


Moreover, as mentioned in Chapter II, MAPE's results at the item level do not 
generate a solution when actual demand is zero. This division by zero error negates the 
ability to generate an average result, unless those non-solutions are ignored, which then 
degrades the entire accuracy measurement. 

Meanwhile, CIMIPf metric avoids that effect by applying a summation in the 
denominator to account for the fact that the data can include items with zero demand. 
Thus, CIMIPf metric is able to produce valid results even when the data set contains 
values of zero for either the actual demand or forecast of individual line items. 

Therefore, we claim that CIMIPf metric is more robust than MAPE. The only case 
which CIMIPf equation does not produce a valid result is when actual demands of all 
items considered are zero. Table 4. aims to provide evidence of the superiority of 
CIMIPf, in terms of robustness, when compared to MAPEf. 


33 



Table 4. Test of Relative Robustness of CIMIPf Compared to MAPEf 


Items 

fi 

ai 

Pi 

1 fi-ai 1 

Ci*ai 

Ci* 1 fi-ai 1 

1 fi-ai 1 /ai 

1 

23.84 

32 

$ 1,354,173.00 

8.16 

$43,333,536.00 

$ 11,050,051.68 

25.5% 

2 

21.26 

0 

$ 43,125.00 

21.26 

$ 

$ 916,837.50 

CO 

3 

0 

2 

$ 32,815.00 

2 

$ 65,630.00 

$ 65,630.00 

100% 

4 

235.42 

151 

$ 260,000.00 

84.42 

$ 39,260,000.00 

$21,949,200.00 

55.9% 





Sum 

$ 82,659,166.00 

$33,981,719.18 






CIMIPf 

59% 


MAPEf oo 


In this case, the actual demand of item 2 is zero, what harms the entire calculation of 
MAPE, while CIMIPf still produces a valid result. This supports the Hyndman & Koheler 
(2006) recommendation that MAPE should not be used in data sets that contain actual 
demands of zero. 


2. The Role of Unit Costs 

As mentioned, CIMIPf is caleulated differently than the most traditional foreeast 
aeeuraey metries, as it implies that summations of foreeast errors and aetual demand 
values have to be made before the division, thus requiring the input data to be in the same 
unit-of-measure. In that eontext, unit eosts are used as a means to standardize the units- 
of-measure of an items’ demand, allowing the summations to oeeur in both the numerator 
and denominator. 

In addition, the inelusion of unit eost also provides a weighting meehanism that 
prioritizes the aeeuraey of more expensive items over less expensive items. In the 
literature we reviewed, there is no mention of the use of weightings by the foreeast 
aeeuraey metries. All traditional equations are ealeulated around the foreeast error. 
Equation (2.1), eonsidering just two independent variables, foreeast values and aetual 
demands. The introduction of another independent variable such as unit cost, in the case 
of CIMIPf , may affect the results. While measuring forecast demand error in dollars is a 

workable metric, the stated goal of CIMIPf is to produce a percentage measure of forecast 
accuracy. 

Another point against the use of unit costs is that a secondary objective of CIMIPf 

metric is to avoid excess inventory and the related costs. One can think that organizations 

must avoid excess inventory of high unit cost items to reduce unwanted financial impacts. 

However, total inventory cost is composed of holding, transportation, handling, 
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acquisition and shortage costs. Of these five costs, only holding cost is directly affected 
by unit eosts, and although positive correlations between unit costs and transportation, 
handling and shortage costs are possible, they are not eertain. While cost is important to 
prioritize forecasting efforts, other factors such as criticality and interchangeability could 
also be considered. Acknowledging that unit cost is not the main driver for the total 
inventory cost or prioritization, we infer that forecast accuracy should be measured as a 
funetion of forecast and actual demand values. 

To determine the positive and negative of using unit cost in the equation, we need 
to test to what extent it can significantly affect the interpretation of foreeast aeeuraey. To 
do this, we built a test composed of four data sets. Table 5. through Table 8. , that keep 
foreeast and demand values constant, while allowing the unit costs to vary: 



Table 5. 

Test of Cost Impact on CIMIPj - 

Data Set I 


Items 

fi 


Ci 

1 fi-ai 1 

Ci*ai 

Ci* 1 fi-ai 1 

1 

90 

100 

$1,000.00 

10 

$100,000.00 

$10,000.00 

2 

30 

100 

$50.00 

70 

$5,000.00 

$3,500.00 

3 

50 

100 

$20.00 

50 

$2,000.00 

$1,000.00 

4 

80 

100 

$250.00 

20 

$25,000.00 

$5,000.00 





Sum 

$132,000.00 

$19,500.00 





CIMIPf 

85% 



Table 6. 

Test of Cost Impact on CIMIPf- 

Data Set 2 


Items 

fi 


Ci 

1 fi-ai 1 

Ci*ai 

Ci* 1 fi-ai 1 

1 

90 

100 

$50.00 

10 

$5,000.00 

$500.00 

2 

30 

100 

$1,000.00 

70 

$100,000.00 

$70,000.00 

3 

50 

100 

$20.00 

50 

$2,000.00 

$1,000.00 

4 

80 

100 

$250.00 

20 

$25,000.00 

$5,000.00 


Sum 

$132,000.00 

$76,500.00 

CIMIPf 

42% 
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Table 7. Test of Cost Impaet on CIMIPf- Data Set 3 


Items 

fi 

ai 

Ci 

1 fr^i 1 

Ci*ai 

Ci* 1 fi-ai 1 

1 

90 

100 

$20.00 

10 

$2,000.00 

$200.00 

2 

30 

100 

$50.00 

70 

$5,000.00 

$3,500.00 

3 

50 

100 

$1,000.00 

50 

$100,000.00 

$50,000.00 

4 

80 

100 

$250.00 

20 

$25,000.00 

$5,000.00 





Sum 

$132,000.00 

$58,700.00 





CIMIPf 

56% 



Table 8. 

Test of Cost Impact on CIMIPf- 

Data Set 4 


Items 

fi 

^i 

Ci 

1 fr^i 1 

Ci*ai 

Ci* 1 fi-ai 1 

1 

90 

100 

$250.00 

10 

$25,000.00 

$2,500.00 

2 

30 

100 

$50.00 

70 

$5,000.00 

$3,500.00 

3 

50 

100 

$20.00 

50 

$2,000.00 

$1,000.00 

4 

80 

100 

$1,000.00 

20 

$100,000.00 

$20,000.00 





Sum 

$132,000.00 

$27,000.00 


CIMIPf 80% 


CIMIPf results ranged from 42% to 85%, what may lead to diverse interpretations of 
foreeast aeeuraey. 


The results of this test demonstrate that the presence of unit cost in CIMIPf metric 
harms the quality of the item demand forecast accuracy measurement. 

3, Production of Intuitive Results 

C/M/P/uses two features commonly found in percentage equations. It first applies 
the complementary concept of “one minus the fraction”, then it multiplies that fractional 
value by 100 to produce a percentage result. 

However, percentage equations are expected to produce values between zero and 
one, which does not occur in CIMIPf. The summation of errors, CIMIP/s numerator, can 
be higher than summation of actual demands, CIMIP/s denominator. That condition 
causes the fraction to be bigger than one and the final number to be negative and 
unbounded, which we consider counter-intuitive. 

To demonstrate that, we built a test comprised of two hypothetical items, as 
follows: 
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Table 9. Generation of Counter-Intuitive Results - Initial Data Set 


fj _a,_Pi_ I fj-aj I _ Ci*ai _ Cj* | fj-aj | 

Test item 1110 1 0 

Fixed item _1_1_1_0_1_0_ 

Sum _2_0_ 

CIMIPf 100% 


By allowing the forecast value of the test item to vary from one to 10, we 
obtained; 


Figure 6. Generation of Counter-Intuitive Results by CIMIPf 



Counter-intuitive, negative results are generated by CIMIPf in cases where the 
summation of errors is larger than the summation of actual demands. Considering the 
results at the item level, we infer that products with errors larger than actual demand may 
exert significant negative pressure on the aggregated C/M/P/^ result. 

Furthermore, under-estimations are bounded by zero and all cases of forecast 
errors larger than actual demand only occur with over-estimations. That inherent 
characteristic of forecast errors helps all accuracy metrics to penalize the occurrence of 
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extremely large over-estimations, which are closely related to the formation of excess 
inventory. 

4, Composition of Data Matters 

If the probability of occurrence of errors, that are bigger than actual demand, is 
assumed to change along with demand size, then the composition of data may affect 
CIMIPf results. One can intuitively assume that low-demand items are more likely to 
have errors bigger than their actual demands. Considering that, if a data set is primarily 
comprised of low-demand items, a poor, or even negative, CIMIPf result is to be 
expected. 

To validate the rationale that composition of data matters, first, we need to test the 
assumption that errors bigger than demand are more frequent in low-demand items. 
According to FY15 data, among 44,675 NIINs, 24,309 (54.41%) had errors bigger than 
demand and they were distributed according to the following histogram: 

Figure 7. Histogram of Items with Errors Bigger than Demand in FY15 



FY15 Demands 


Vertical axle in exponential scale helps to picture the extreme skewness of the data. 

Second, we divided the data into low-demand and high-demand items, according 
to a quantile approach, to compare C/M/P/ results. 
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Table 10. C/M/P/Results on Low-Demand Versus High-Demand Items 

(FY15) 

Dmd size Dollar error Dollar dmd CIMIPf 

_ items _ _ 

Low-demand 0-1 28,235 $555,585,938.00 $125,993,049.00 -341% 

High-demand 2-inf 15,690 $2,022,307,097.00 $4,863,140,807.00 58% 

Aggregate 0-inf 43,925 $2,577,893,035.00 $4,989,133,856.00 48% 

There is clear evidence that low-demand items can exert a negative pressure on the overall 
result. 

Additionally, Table 11. shows that C/M/P/results tend to be better as we only 
eonsider items with higher demand. The aggregate CIMIPf, 48%, disguises the faet that 
for high-demand items the dollar-error is relatively small, while for low-intermittent 
demand items, the dollar-error relative to the aetual dollar-demand is very large. 


Table 11. Data Composition and C/M/P/Variation (FY15) 


Dmd size 

Qty of items 

Dollar error 

Dollar dmd 

CIMIPf 

0-inf 

43925 

$ 2,577,893,034.49 

$4,989,133,856.14 

48.33% 

100-inf 

546 

$ 294,122,953.00 

$ 941,957,920.00 

68.78% 

500-inf 

90 

$ 22,186,297.00 

$ 79,316,220.00 

72.03% 

1000-inf 

49 

$ 13,291,976.00 

$ 48,555,715.00 

72.63% 


We partially attribute those inereasing CIMIPf results to the faet that the errors 
bigger than demand are more unlikely as demand inereases. But, on top of that, there is 
the faet that items with higher demand usually display a pattern that faeilitates the 
generation of aeeurate foreeasts. 

Therefore, eombining results of the three tests eondueted in this seetion, we infer 
that the eomposition of the data set, expressed as a ratio of high and low-demand items, 
ean signiHeantly affeet CIMIPf results. The higher the ratio of low to high-demand items, 
the more likely the result will be a lower foreeast aeeuraey measurement. 

C. COMPARATIVE ANALYSIS 

Considering the potential flaws of CIMIPf, mentioned above, a eomparative 

analysis is neeessary to allow a judgment about the existenee of a better metrie. After 
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reviewing the existing literature, we seleeted an alternative metrie and developed a 
framework to allow a fair eomparison between the two metries. 


1. Alternative Metric Selection 

As discussed in the literature review, MASE is intuitively expected to gather most 
of the desirable characteristics of a forecast accuracy measure, thus justifying its use as 
an alternate metric for comparison. Specifically, one of the main characteristics of MASE 
is the capacity to produce accuracy results at the item level, even when actual demand is 
zero, as well as at the aggregate level. Another important characteristic is that it enables a 
fair comparison among the services and DLA through its use of a benchmark method 
instead of generating absolute values. 

a. Further Discussion on Performance Benchmarking 

According to Dictionary.com, the word benchmark is “any standard or reference 
by which others can be judged” and the practice of using a benchmark to measure 
performance is widely practiced. An additional definition of the word is “a standard of 
excellence, achievement, etc., against which similar things must be measured or judged” 
(Ditcionary.com) and this idea of comparing similar things is key. Most people have 
heard a version of the phrase comparing apples and oranges and it applies to many areas 
where comparisons are made between two or more things. In our research we have 
discussed how DOD intends to measure the forecasting performance of the military 
services and DLA by calculating how well each of them generated forecasts for the 
material that they manage. While this exercise in measurement and comparison is 
intended to complement the goals of the overall CIMIP, it does not mean that we are 
making a true “apples to apples” comparison. 

CIMIPf is simply computed by inserting forecasted demand, actual demand and 

unit cost for each item into the equation, which then produces one number. Although 

each service and DLA is engaged in managing secondary inventory, the material, 

quantity and demand patterns of this inventory are not the same. While they may appear 

similar and in some ways are, the fact is that they each face unique challenges in 

forecasting their demand and it is potentially misleading to directly compare their 
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performance. To illustrate this point with something that all federal employees are 
familiar with, we will examine the use of benchmarks by the Thrift Savings Plan (TSP). 

On April 1, 1987, the TSP began operations with a single fund, known as the G 
Fund, which invested solely in government securities that were not available to the 
public. By 2001, the number of investment funds available in the TSP had grown to five 
with the inclusion of the fixed income F Fund, the common stock C Fund, the small 
capitalization stock S Fund and the international stock I Fund. Following common 
industry practice, since each of these four new funds were invested in securities available 
to the public, each funds’ performance is compared against a commercial index made up 
of similar assets. These commercial indexes act as performance benchmarks for the 
funds. Since the TSP funds are modeled after these commercial indexes, a strategy 
known as passive-management, their performance does not vary much from the index. 
This common industry practice becomes more important with actively managed funds, 
where managers are attempting to outperform these commercial indexes. Table 12. 
shows the TSP fund with its respective index or benchmark and Table 13. compares the 
performance of the TSP funds against their benchmark index. 


Table 12. TSP Fund and Benchmark Index. Adapted from Thrift Savings 

Plan (n.d.b). 


TSP Fund 

Commercial Benchmark 

G Fund 

N/A 

F Fund 

Barclays Capital U.S. Aggregate Bond Index 

C Fund 

Standard & Poor's 500 Stock Index 

S Fund 

Dow Jones U.S. Completion TSM Index 

1 Fund 

MSCI EAFE Stock index 
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Table 13. TSP and Index Annual Returns 2011-2015. Souree; Thrift Savings 

Plan (n.d.a). 


Year 

G Fund 

FFund 

U.S. 

Agg. Bond 
Index 

C Fund 

S&P 500 

Index 

S Fund 

DJ U.S. 
Completion 
TSM Index 

1 Fund 

EAFE 

Index 

2011 

2.45% 

7.89% 

7.84% 

2.11% 

2.11% 

-3.38% 

-3.76% 

-11.81% 

-12.14% 

2012 

1.47% 

4.29% 

4.22% 

16.07% 

16.00% 

18.57% 

17.89% 

18.62% 

17.32% 

2013 

1.89% 

-1.68% 

-2.03% 

32.45% 

32.39% 

38.35% 

38.05% 

22.13% 

22.78% 

2014 

2.31% 

6.73% 

5.97% 

13.78% 

13.69% 

7.80% 

7.63% 

-5.27% 

-4.90% 

2015 

2.04% 

0.91% 

0.55% 

1.46% 

1.38% 

-2.92% 

-3.42% 

-0.51% 

-0.81% 


This table demonstrates how an individual TSP funds’ performance compares to a 
benchmark index, rather than a simple comparison to the other TSP funds. 


The eomparison to these benehmark index funds enables managers and potential 
investors to better judge the effeetiveness of the TSP fund managers to meet their 
intended objeetive. For example, an S Fund investor should be satisfied with the 
management of his fund for all five years even though the C Fund had better returns in 
three of the five years. An apples-to-oranges eomparison of the S and C Funds over these 
five years would conclude that the S Fund manager performed better in only two of the 
five years, while the C Fund manager performed better in three of the five years. An 
apples-to-apples comparison of these two fund managers would conclude that both of 
them matched or exceeded the performance of their benchmark index in all five years. 

b. DOD Forecasting Benchmarks 

The same principle of comparing investment fund performance to a relevant 
benchmark applies to the comparison of the services and DLA in their year-to-year 
forecasting performance. Concluding that one service forecasted better than another, 
based on a single CIMIPf metric result, ignores the fact that the lower-performing service 
may be managing material that is much more challenging to forecast than the higher¬ 
performing service. To date the DOD has resisted GAO recommendations to set standard 
forecasting performance goals, which could potentially result in apples-to-oranges 
comparisons. The DOD has stated that it wanted “to establish a baseline of performance 
on the metrics prior to setting any department-wide goals” (GAO, 2015b, p. 43), yet a 
department-wide goal, while simple, may not be as effective in measuring true forecast 
performance. An alternate method would be for each service to generate forecast 
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accuracy metrics for a naive method foreeast of their material and compare that with their 
actual performance. In keeping with our investment fund analogies, this method of 
evaluation is similar to how aetively managed investment portfolios are eompared against 
an index of similar assets. 

The calculation of a naive method simply requires the user to determine the level 
of demand for the preceding period and then assume that the demand will remain the 
same in the future period. 

In order to exemplify the function of naive method as a benehmark, Table 14. 
presents a set of three hypothetic items with different levels of demand variability, what 
is visualized in Figure 8. , along with their aecuracy results, measured by four different 
metries. 
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Table 14. Naive Method as a Benchmark 


Item 1 


Time 

Demand 

naive 

Err 

Abs err 

Sq err 

APE 

1 

10 





0% 

2 

9 

10 

-1 

1 

1 

11% 

3 

11 

9 

2 

2 

4 

18% 

4 

10 

11 

-1 

1 

1 

10% 

5 

9 

10 

-1 

1 

1 

11% 

6 

11 

9 

2 

2 

4 

18% 

7 

11 

11 

0 

0 

0 

0% 

8 

10 

11 

-1 

1 

1 

10% 

9 

9 

10 

-1 

1 

1 

11% 

10 

10 

9 

1 

1 

1 

10% 


Stdev 

0.8164966 




MAE 

1.11 

Avg 

10 




MSE 

1.56 

cv 

0.0816497 




CIMIP 

90% 






MAPE 

10% 




Item 2 




Time 

Demand 

naive 

Err 

Abs err 

Sq err 

APE 

1 

10 





0% 

2 

6 

10 

-4 

4 

16 

67% 

3 

14 

6 

8 

8 

64 

57% 

4 

10 

14 

-4 

4 

16 

40% 

5 

6 

10 

-4 

4 

16 

67% 

6 

14 

6 

8 

8 

64 

57% 

7 

14 

14 

0 

0 

0 

0% 

8 

10 

14 

-4 

4 

16 

40% 

9 

6 

10 

-4 

4 

16 

67% 

10 

10 

6 

4 

4 

16 

40% 


Stdev 

Avg 

CV 

3.2659863 

10 

0.3265986 




MAE 

MSE 

CIMIP 

MAPE 

4.44 

24.89 

60% 

43% 




Item 3 




Time 

Demand 

naive 

Err 

Abs err 

Sq err 

APE 

1 

10 





0% 

2 

3 

10 

-7 

7 

49 

233% 

3 

17 

3 

14 

14 

196 

82% 

4 

10 

17 

-7 

7 

49 

70% 

5 

3 

10 

-7 

7 

49 

233% 

6 

17 

3 

14 

14 

196 

82% 

7 

17 

17 

0 

0 

0 

0% 

8 

10 

17 

-7 

7 

49 

70% 

9 

3 

10 

-7 

7 

49 

233% 

10 

10 

3 

7 

7 

49 

70% 


Stdev 

5.7154761 

MAE 

7.78 

Avg 

10 

MSE 

76.22 

CV 

0.5715476 

CIMIP 

30% 



MAPE 

107% 


All four accuracies of naive forecasts are higher in item 1, which has the smallest 
coefficient of variability in the dataset. The opposite also holds as the worst accuracy 
results in all metrics were obtained in the item that has the highest coefficient of 
variability. Since this analysis is at the item level, we applied CIMIPj*, Equation (3.3). 
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Figure 8. Different Levels of Variability 



Items’ demands were designed to provide clear understanding of existing different levels 
of variability. 

According to the example, with naive method, material with lower level of 
variability generates relatively accurate forecast, while material with higher level of 
variability generates relatively poor forecasts. 

The summing of all of individual accuracy results, in a big set of items, should 
provide the user with a general idea of how difficult the population of material is to 
forecast. A large error signifies a difficult population, while a small error signifies a 
simple population. 

In the same manner that investors expect their asset managers to provide value 
greater than a passively-managed investment, so too should the DOD expect its material 
managers to generate forecasts that generally perform better than a naive method 
benchmark. Table 14. and Figure 9. demonstrate how utilizing a naive benchmark like 
this would give DOD leadership a better understanding of how well its components were 
actually forecasting. While the Navy is more interested in improving its own forecasting 
efforts, the DOD needs to be able to accurately assess the performance of all five 
reporting agencies. 


45 





Table 15. Theoretical Forecast and Benchmark Performance 


Year 

Army 

Forecast 

Accuracy 

Army Naive 
Benchmark 

Navy 

Forecast 

Accuracy 

Navy Naive 
Benchmark 

Air Force 

Forecast 

Accuracy 

Air Force 

Naive 

Benchmark 

DLA 

Forecast 

Accuracy 

DLA Naive 

Benchmark 

2011 

30% 

40% 

45% 

40% 

55% 

60% 

90% 

85% 

2012 

40% 

35% 

55% 

45% 

60% 

75% 

80% 

90% 

2013 

10% 

30% 

49% 

42% 

64% 

55% 

70% 

80% 

2014 

32% 

25% 

50% 

40% 

62% 

65% 

75% 

85% 

2015 

40% 

30% 

48% 

45% 

70% 

70% 

80% 

90% 

Average 

30% 

32% 

49% 

42% 

62% 

65% 

79% 

86% 


Numbers are fictional. This table demonstrates how naive method benchmarks can bring 
forecast accuracy results into perspective, in a similar way that TSP fund perfonuance is 
compared to a benchmark index. 


Figure 9. Theoretical Chart Comparing Navy Versus DLA Forecasting 

Efforts (Numbers are Fictional) 

100 % 

80% 

60% 

40% 

20 % 

0 % 

This figure intends to demonstrate that if a manager considered forecast accuracy in 
isolation then they would conclude that DLA was outperforming the Navy, but if the 
manager was provided with benchmarks then they may reach the opposite conclusion. 



2011 2012 2013 2014 2015 

■ Navy Forecast Accuracy □ Navy NaTve Benchmark 

■ DLA Forecast Accuracy □ DLA NaTve Benchmark 


2, Tests of Desirable Characteristics 

We selected four characteristics regarded as relevant to any reliable forecast 
accuracy metric, as follows: sensitivity to volume heterogeneity, symmetry on error 
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treatment, robustness at individual and aggregated levels and allowanee for a fair 
eomparison. 

In order to provide a means to a eomparison between aeeuraey metrics, we 
designed particular tests to each one of the desirable characteristics. In the end of this 
section, we gathered results in a judgment table to point the best metric. 

a. Sensitivity to Volume Heterogeneity 

Assuming all items are of equal value, pure forecast accuracy aggregated metric 
must give equal importance to each item. Otherwise, if any kind of weight is applied to 
specific items, results can be seriously harmed. Since the impact of unit cost variation in 
C/M/P/has already been tested in this research, we still need to test whether its results are 
potentially dominated by large forecasts and actual demands. It is obvious that different 
items contribute different amounts to the overall CIMIPf. But, since the item weight is 
composed of the demand volume and the unit cost, the degree to which high-volume 
items contribute disproportionately in any given dataset is an empirical question (again, 
assuming equal proportionality is what is desired). In this section, we test the relative 
sensitivity of CIMIPf and MASE to volume heterogeneity across inventory items. 

We built a test, comprised of two fictional datasets per accuracy metric, to check 
the possibility of the generation of type I errors, saying the forecast is accurate when it is 
actually inaccurate, and type II errors, saying the forecast is inaccurate when it is actually 
accurate. 

The first data set was designed to reflect a situation in which the forecast value is 
very close to the actual demand in one high-volume item, but the forecast model 
performs poorly in nine other low-volume items. In that situation, we should expect 
CIMIPf result to tell that the aggregated accuracy is low, thus the forecast method is 
performing poorly. Otherwise, type I error arises. 
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Items 

fi 

Table 

16. High Volume and Type I Errors - CIMIPf 

Ci 1 fi-aj 1 Ci*ai Ci* 1 fi-ai | 

1 

9000 

10000 

$ 1,000.00 

1000 

$ 

10,000,000.00 

$ 1,000,000.00 

2 

5 

10 

$ 1,000.00 

5 

$ 

10,000.00 

$ 5,000.00 

3 

5 

10 

$ 1,000.00 

5 

$ 

10,000.00 

$ 5,000.00 

4 

5 

10 

$ 1,000.00 

5 

$ 

10,000.00 

$ 5,000.00 

5 

5 

10 

$ 1,000.00 

5 

$ 

10,000.00 

$ 5,000.00 

6 

5 

10 

$ 1,000.00 

5 

$ 

10,000.00 

$ 5,000.00 

7 

5 

10 

$ 1,000.00 

5 

$ 

10,000.00 

$ 5,000.00 

8 

5 

10 

$ 1,000.00 

5 

$ 

10,000.00 

$ 5,000.00 

9 

5 

10 

$ 1,000.00 

5 

$ 

10,000.00 

$ 5,000.00 

10 

5 

10 

$ 1,000.00 

5 

$ 

10,000.00 

$ 5,000.00 


Sum $ 

10,090,000.00 $ 1,045,000.00 

CIMIPf 

89.64% 


Since there is not eurrently a DOD threshold for what constitutes an aecurate 
foreeast, we assume CIMIPf > 80%, to classify the forecast as accurate. The result of this 
data set is not aligned to the initial expectation of poor performance. Therefore, we state 
that the result led to a type I error. 

The seeond data set aims to represent the opposite situation. A high-volume item 
has a poor forecast, while nine low-volume items have good quality on forecasts. In that 
situation, we should expect that CIMIPf result indieate a good forecast accuracy. 
Otherwise, a type II error is considered to oceur. 


Items 

fi 

Table 17. High Volume and Type II Errors 

aj Ci 1 fi-ai 1 Ci*ai 

- CIMIPf 

Cl* 1 fi-ai 

1 

1 

5000 

10000 

$ 

1,000.00 

5000 

$ 10,000,000.00 

$ 5,000,000.00 

2 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

3 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

4 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

5 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

6 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

7 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

8 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

9 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

10 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 


Sum_$ 10,090,000.00_$ 5,009,000.00 


CIMIPf 50.36% 


Using the same threshold of CIMIPf > 80% to classify an accurate foreeast, the 
result of this data set is also not aligned to the initial expectation of good performance. 
Therefore, we state that the result led to a type II error. 
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On the other hand, as MASE metrie requires a slightly different type of data to be 
ealeulated. Henee, we ereated a very similar test, eomprised of two other data sets that 
refleet the same situation as used to uneover the dominanee of high-volume items in 
C/M/P/equation. Likewise, the same error-type definitions held true. 

The first test, again, is the ease in whieh nine low-volume items have relatively 
high foreeast errors, while one high-volume item has a relatively low foreeast error. In 
that arrangement, we should expeet the result to tell a poor performanee. Otherwise, we 
will eonsider the existenee of type I error. 


Table 18. High Volume and Type I Errors - MASE 


MAE of in-sample 


Items 

fi 

ai 

naive 

e, 

qt 

1 

9000 

10000 

2500.00 

1000 

0.40 

2 

5 

10 

2.50 

5 

2.00 

3 

5 

10 

2.50 

5 

2.00 

4 

5 

10 

2.50 

5 

2.00 

5 

5 

10 

2.50 

5 

2.00 

6 

5 

10 

2.50 

5 

2.00 

7 

5 

10 

2.50 

5 

2.00 

8 

5 

10 

2.50 

5 

2.00 

9 

5 

10 

2.50 

5 

2.00 

10 

5 

10 

2.50 

5 

2.00 





MASE 

1.84 


Assuming a threshold of MASE < 0.8 to elassify an aeeurate foreeast, whieh is 
undoubtedly better than a naive foreeast, the result aligns with the initial expeetation. 
Therefore, there is no evidenee of type I error. 

The seeond test is about the opposite situation, as nine low-volume items have 
good quality on their foreeasts and one high-volume item has a poor foreeast. We should 
expeet a good aeeuraey result. 
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Table 19. Large Numbers and Type I Errors in MASE 


Items 

fi 


MAE of in-sample 
naive 

Ct 

qt 

1 

5000 

10000 

2500.00 

5000 

2.00 

2 

9 

10 

2.50 

1 

0.40 

3 

9 

10 

2.50 

1 

0.40 

4 

9 

10 

2.50 

1 

0.40 

5 

9 

10 

2.50 

1 

0.40 

6 

9 

10 

2.50 

1 

0.40 

7 

9 

10 

2.50 

1 

0.40 

8 

9 

10 

2.50 

1 

0.40 

9 

9 

10 

2.50 

1 

0.40 

10 

9 

10 

2.50 

1 

0.40 





MASE 

0.56 


Assuming the same threshold of MASE < 0.8 to elassify an aeeurate foreeast, the 
result aligns with the initial expectation. Therefore, we find no evidence of a type II error. 

Considering the results of all four tests, it appears CIMIPf is less sensitive to 
volume heterogeneity than MASE, and hence, more likely to produce misleading results 
because of volume heterogeneity. 

b. Symmetry on Error Treatment 

As mentioned before, forecast errors in inventory demand data are bounded to the 
negative side, as result of underestimations, and unbounded to the positive side, as result 
of overestimations. However, forecast methods are expected to generate reasonable errors 
for the majority of items. Hence, we designed this test to verify whether equivalent 
variations of actual demand values, within a moderate range, to positive and negative 
sides, can result in different impacts for C/M/P/than MASE. Table 20. shows the initial 
arrangement of the test. 
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Table 20. Initial Dataset to Test Error Side Equality - CIMIPf 

Items fj aj Ci | fi-ai | Ci*ai Ci* | fi-a, | 

1 

100 

100 

100 

0 

$ 

10,000.00 

$- 

2 

100 

100 

100 

0 

$ 

10,000.00 

$- 

3 

100 

100 

100 

0 

$ 

10,000.00 

$- 

4 

100 

100 

100 

0 

$ 

10,000.00 

$- 





Sum 

$ 

40,000.00 

$ 





CIMIPf 


100% 



Decision Variable: A1 


Uniform distribution with parameters: 

Minimum 0.00 

Maximum 50.00 


Decision Variable: A2 

Uniform distribution with parameters: 

Minimum 50.00 

Maximum 100.00 


Decision Variable: A3 

Uniform distribution with parameters: 

Minimum 100.00 

Maximum 150.00 


Decision Variable: A4 


Uniform distribution with parameters: 

Minimum 150.00 

Maximum 200.00 

Considering that the ranges of variation was designed to eause an equal 
proportion of positive and negative errors, an intuitive result should be that items with 
bigger errors on both sides would mostly eontribute to C/M/P/variations. 
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However, according to Figure 10. , overestimations seem to impose a heavier 
pressure on C/M/P/results, compared to what underestimations do. 


Figure 10. Sensitivity Chart of C/M/P/ Equal Treatment Test 



The equivalent test applied on MASE is shown in Table 21. . 
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Table 21. Initial Dataset to Test Error Side Equality - MASE 




Item 1 



Item 3 



FY13 

FY14 

FYI5 


FYI3 

FYI4 

FYI5 

ft 

100 

100 

100 

ft 

100 

100 

100 

at 

100 

100 

100 

at 

100 

100 

100 

n 

50 

100 

100 

n 

50 

100 

100 


50 

0 

0 

fi-fi-1 

50 

0 

0 

et 

- 

- 

0 

et 

- 

- 

0 

qt 

0 


qt 

0 




Item 2 



Item 4 



FY13 

FYI4 

FYI5 


FYI3 

FYI4 

FYI5 

ft 

100 

100 

100 

f, 

100 

100 

100 

at 

100 

100 

100 

at 

100 

100 

100 

n 

50 

100 

100 

n 

50 

100 

100 


50 

0 

0 


50 

0 

0 

et 

- 

- 

0 

et 

- 

- 

0 

qt 

0 


qt 

0 



MASE 


0 


Decision Variable: A1 


Uniform distribution with parameters: 

Minimum 0.00 

Maximum 50.00 


Decision Variable: A2 


Uniform distribution with parameters: 

Minimum 50.00 

Maximum 100.00 


Decision Variable: A3 


Uniform distribution with parameters: 

Minimum 100.00 

Maximum 150.00 


Decision Variable: A4 


Uniform distribution with parameters: 
Minimum 
Maximum 


150.00 

200.00 
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Different than what happened to CIMIPf, the sensitivity chart in Figure 11. shows 
that MASE gives balanced importance to errors in both sides. 


Figure 11. Sensitivity Chart of MASE 


Sensitivity; MASE 

•300% -200% -100% 00% 100% 20 0% 



c. Robustness at Individual and Aggregate Levels 

Acknowledging the fact that no forecast method is expected to perform well in all 
situations, we agree with Fildes (1989) by stating that individual level analysis is more 
powerful for managers, as it enables to locate the origins of inaccuracy. 

C/M/P/was initially designed and has been used to calculate an aggregate number 
that represent the overall forecast performance of each service. To do so, WSS has used 
twelve-month windows of data to allow calculations of total dollar-errors and total dollar- 
demands, the two key components of CIMIPf equation. As mentioned before in this 
chapter, CIMIPf is considered a robust metric at the aggregated level. 
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However, when the intent is to produee accuracy measures at the item level, WSS 
managers take out the summation signs and the unit cost from the CIMIPf original 
formula (E. Liskow, personal communication, April 4, 2016), resulting in the following: 


CM IP. = 1 


a, 


(3.2) 


We infer that this equation suffers from the same vulnerability as MAPE, 
Equation (2.7), which is returning an infinite value when actual demand is zero. In the 
specific case of Navy’s demand data, the occurrence of zero demands are highly likely, as 
mentioned before. 


In this research, we consider robustness as the ability to produce valid results, not 
undefined, in majority of situations, which is in accordance to Baker et al. (2006). 
Therefore, as CIMIPt returns invalid values in a significant amount of items in the Navy’s 
dataset, the metric is classified as not robust. 

However, a different approach is possible to improve the robustness of CIMIPi 
equation. Rather than taking the summation sign out, the Navy could sum forecast errors 
of one item, through the time. Unit cost is constant at the item level and is present in both 
summations of the fraction. Hence, they can be put in evidence and cancels out. After 
applying those adjustments, the proposed equation should be: 

CIMIP,. = 1 - ^ - (3.3) 

t=\ 

That equation is only vulnerable to the specific case of all actual demands being 
zero, during the time considered. Therefore, as the time window increases, the probability 
of a zero value in the denominator is expected to reduce. Just as an example of the gain in 
robustness that this variation of the metric represents, when applied to a five year, 
quarterly demand dataset, CIMIPi* was able to return 100% of valid results, in contrast to 
only 52% of valid results of CIMIPi when applied to the EY15 demand dataset. 

On the other hand, MASE metric was originally designed to be used in both 
dimensions of measurement, through the time and across the items, as used by 
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Hyundman and Koehler (2006). Moreover, the denominator vulnerability is related to the 
occurrence of all zero forecast errors, instead of all zero actual demands in ClMlPi*, 
which is yet more unlikely to happen. 

Therefore, we can state that MASE metric is potentially more robust than ClMlPt*, 
although its gains are not perceived in the data considered, as the second could generate 
100% of valid results. 

d. Allowance for Fair Comparison 

Forecast accuracy values are often used as a means of performance comparison. 
In that context, it is very important to set the ground for a fair comparison to occur. Non- 
relative metrics do not account for the fact that different datasets may comprise diverse 
amounts of variability that create different levels of predictability and makes the 
comparison in absolute numbers unfair. Therefore, comparisons of ClMlPf results at the 
aggregated and individual levels tend to be harmed by different levels of demand 
predictability in each dataset. MASE, conversely, uses naive method as a benchmark to 
account for the level of demand predictability. 

Table 22. helps to explain the difference in the interpretation of results. 

Table 22. Difficulty to Forecast Test 

CV CIMIPj, MASE 

More Predictable 0.125494 92.23% 0.49 

Less Predictable _ 1.937644 -0.001% 0.57 

Values were calculated using data from two real items, picked as representatives of high 
and low coefficients of variation. 

Considering a threshold of ClMlPi* > 80% to classify an accurate forecast, only 
the forecasts of the “more predictable” item qualifies. To keep consistent, we applied a 
threshold of MASE < 0.80 to classify as an accurate forecast. By doing so, forecasts of 
both items surpass the requirement. 

Based on this example, we see that if the forecast metric is to be used to compare 
accuracy of item forecasts (to compare IM’s for example) MASE may do a better job 
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controlling for the underlying variability of the data, and present a better picture of the 
relative performance on eaeh item (or by each IM). Of course, this is a simplification. 
MASE controls for only one souree of variation: single period autoeorrelation. Still, the 
point is that not all datasets are equally predietable, and caution should be used when 
comparing the accuracy of organizations managing different populations of material. 

Extrapolating this result to the aggregated level, we can assume a hypothetical 
scenario of two datasets where one is mostly comprised of more predictable items and the 
other is mostly comprised of less predietable items. When measuring aecuraey in 
absolute numbers, the results of the seeond dataset will more likely be worse than the 
first. Alternatively, MASE benchmarks performance against the naive method, whieh 
enables the less predietable dataset to generate a relatively better result than the more 
predietable dataset. 

D, CHAPTER SUMMARY 

The main objectives of this chapter were to uncover evidences of inherent flaws 
of CIMIPf metric, through the applieation of specifie tests, as well as to draw a 
comparison to an alternative metric, found in the literature. 

The key lessons of the C/M/P/metric evaluation were: 

• Type I and Type II errors are expected to occur; 

• It can generate counter intuitive (e.g., negative) results; 

• The composition of the data set (e.g., level of variability) infiuenees its 
results. 

Additionally, Table 23. aims to summarize the results of the tests eontained on 
the comparative analysis. 

Table 23. Ranked Comparison of MASE and CIMIPf 


Desirable Characteristics 

MASE 

CIMIP, 

Dominance of high-volume 

1 

2 

Error side equality 

1 

2 

Robustness at aggregate and individual 



levels 

1 

1* 
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Allow for comparability between items 1 2 

This table ranks each desirable characteristic. * Grade attributed in case C/M/P,«is used. 

In addition to demonstrating the theoretical problems with CIMIPf, we compared 
it to another metric that has been highly recommended in the literature. Our comparison 
was based on the numerical analysis of a set of generated examples, whieh are not 
representative, so the generalization of the findings is problematic. Based on our test set, 
it appears that C/M/P/ performs poorly relative to MASE. 
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IV. ANALYSES ON FORECAST PROCEDURES 


A, INTRODUCTION 

This chapter presents the calculations involved in the generation of a flexible 
foreeast model rather than applying a fixed forecast method as a solution that fits all the 
items. The model uses a pool of forecast methods and foreeast aecuracy metrics, applied 
at the item level, as a means to optimize the selection of the forecast method to mitigate 
the expected error in forecasting. 

B, BACKGROUND ON CURRENT NAVY’S FORECASTING PROCESS 

NAVSUP is tasked with managing over 350,000 lines items (E. Liskow, personal 
communication, April 4, 2016) as they progress through six LCI eategories. Ed’s 1 and 
2 cover the period from initial operational eapability to the material support date when 
there is little to no historieal demand data, while ECI 3 occurs during the demand 
development interval. ECI’s 4 and 5 cover the periods when the weapon system program 
is mature and has been identified for retirement, while ECI 6 covers the period after the 
offieial retirement. The way the Navy forecasts demand is different throughout each of 
these ECI’s, yet in this paper we will only foeus on the forecasting procedures for Ed’s 
4 and 5. Currently, ECI 4 consists of approximately 284,000 lines items and ECI 5 
eonsists of approximately 23,000 lines items (E. Eiskow, personal communication, April 
4, 2016); yet only about 40,000 of these lines items generate aetual demand in a given 
year and meet the CIMIP definition of a foreeastable item. The Navy utilizes a 
customized Enterprise Resource Planning (ERP) program to generate forecasts for all 
Ed 4 and 5 line items, yet not all of these forecasts will factor into the CIMIP foreeast 
accuracy metrics. 

In a broad sense, the foreeasting proeess begins by segregating the global 
wholesale demand for the previous five years in to 20 quarterly buckets. It is important to 
note that this wholesale demand is not the retail, or end unit, demand, but rather the 
replenishment purehases made by the purchasing agents at the wholesale level. With 
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these 20 quarters of historieal demand caleulated for all LCI 4 and 5 line items, ERP runs 
an exponential smoothing with baekcasting algorithm, utilizing a smoothing faetor, or 
alpha (a), equal to 0.2. From these calculations ERP generates a constant quarterly 
forecast for the next five years. Since the forecasted demand is constant, it is sufficient to 
multiply one quarter by four to generate the annual forecasts for the next five years. This 
forecasting process is repeated every quarter in an attempt to capture demand changes in 
the items with higher variability. The forecasts generated by ERP are also subject to 
review by their IM who has the option to modify them as they deem appropriate. Elpon 
completion of the IM review, the demand forecast is finalized and published for use in 
purchasing and other material management decisions. 

The Office of the Secretary of Defense for Supply Chain Integration requires that 
each component report their forecast metrics semi-annually at the inventory management 
review. In April and October NAVSUP generates the Navy’s official CIMIP accuracy 
and bias metrics by comparing the original forecast for the preceding 12 months with the 
actual demand during that period. Since the beginning of CIMIP metric reporting in 
FY13, NAVSUP has made attempts to improve their forecasting results by correcting 
erroneous data and identifying the specific line items with the most significant 
forecasting errors (E. Eiskow, personal communication, April 4, 2016). While current 
capabilities have made it necessary to utilize a one-size-fits-all forecasting model, in the 
future they plan to enhance their ability to generate tailored forecasts for those items 
which the one-size-fits-all forecasting method produces inferior results (E. Eiskow, 
personal communication, April 4, 2016). 

C. OBJECTIVE OF THE MODEL 

The mathematical model applied in this chapter aims to fill the existing gap 
between the current forecast process that uses a fixed method with fixed parameters and 
the desired stage of a tailored solution. The limitation of the model is that we arbitrarily 
chose the parameters to initiate the calculations, instead of using computational tools to 
optimize the choice. 
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As mentioned before in this research, DOD requests the generation of an accuracy 
number that is capable to represent the overall performance of the components in 
forecasting the items’ demand in a given fiscal year. Those measures, combined with a 
certain threshold, aims to induce improvements in the components’ processes of 
forecasting, what is expected to help in the effort of reducing the excess inventory. 

Additionally, we consider that the Navy’s forecasters can benefit from the 
accuracy measures to improve their works. The idea is to use those measures as a means 
to identify relevant deviations and to help in deciding about the most effective way to 
generate the forecasts. Hence, from the perspective of forecasters, the information needed 
is slightly different. Rather than generating a number that represents the overall ability to 
produce accurate forecasts in a given period, a new approach should be the measurement 
of an item’s accuracy, along the time. 

We also acknowledge the fact that there is no absolute best forecast method, 
capable to generate the most accurate values for each one of the line items. Therefore, we 
designed a test that aims to test whether there are particular patterns of demand in which 
specific forecast methods tend to outperform the others. Moreover, we intend to present 
an aid for decision making, when a forecaster is dealing with an extensive and 
heterogeneous set of items’ demands. 

D, MODEL DESIGN 

In order to generate the required information, we built a flexible forecast model, 
which selects each individual item, generates forecasts values using a pool of forecast 
methods and measures accuracy in a particular way to identify the forecast method that 
mitigates the forecast error. Once the whole data is trimmed, a cycle of events takes place 
in order to generate the intentioned information. Figure 12. shows the sequence of tasks 
involved in the model. 
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Figure 12. Model’s Flow Chart 



Analyze results 


Separate Fit and 
Test Periods 


Generate 

aggregated 

information 



The following sections will describe the relevant tasks of the model. 


1, Trim the Data 

The original data set used to initiate the model comprehends five years of past 
demand of 80,427 NllNs. Demand data is grouped into 20 bins, each one representing a 
quarter of fiscal year. In order to allow the calculations of six different forecast methods 
and four different accuracy metrics, the items that did not meet the minimum 
requirements were withdrawn. 

One limitation of the model used in this analysis is that one of the forecast 
methods and one of the accuracy metric are not able to generate valid results in all 
situations. In order to avoid invalid results, considered as infinite, the data set has to be 
trimmed to comprise only items that fulfill two conditions: variable demand in the first 
four periods and at least one demand of size bigger than zero in the last eight periods. 
Applying those conditions, 30,472 items remained out of a total dataset of 80,427 items. 
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2, Separate Fit and Test Periods 

Following the procedure existing in Makridakis et al (1998), each item’s demand 
is broken into two pieces. The first is called fit period and the second is called test period. 
The first set of data corresponds to the first 12 periods and is basically used to initiate the 
forecast methods. The second is formed by the demand on the subsequent eight periods 
and is used to test the difference between the forecast generated and the actual demand. 
Figure 13. present the demand and the two periods of a sample item in a visual form. 


Figure 13. Fit and Test Periods 

14646078 



0 ■ 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 


The blue curve shows the demand of item NUN 01-464-6078. The dashed line in red is the 
break point of fit and test periods. Forecasts are generated from period 13 to 20 in order to 
allow comparisons to the actual demand. 


3, Calculate Forecasts 

Makridakis et al. (1998) define three categories of forecasts: quantitative, 
qualitative and unpredictable. All quantitative methods assume that the identified pattern 
of past demand is expected to hold in the future. Additionally, time series is the name of a 
family of forecast methods existing in the quantitative category. 

Considering the fact that no item in LCI 4 and 5 is expected to generate demand 
shifts, trends or seasonality, we assumed that the demand pattern is stationary. Hence, our 
forecasting model comprises six of the simplest time series forecast methods found in 
literature. 
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We selected two averaging methods, two exponential smoothing methods, a 
combination of methods and the one that the Navy is currently using. We used the same 
taxonomy of (Makridakis et ah, 1998) to present the methods, as follows: 

a. Simple Average (SA) 

This method averages all available demand data, according to the following 
equation: 

(3-4) 

t /=1 

where: 

t = amount of available demand data at the moment that the forecast is generated. 

Hence, as the variable i increases, the amount of available demand points also 
increases, making the SA to consider more data. 

b. Moving Average (MA) 

As opposite to what happens in SA, this method averages a fixed amount of the 
most recent demand data. The mentioned fixed amount of observations is called as order 
of average. The MA equation follows: 

Z (3.5) 

^ i=t-kA\ 

where: 

k = order of average 

The smaller the order of average, the more responsive to peaks and shifts in 
demand the method turns. For this research, we used a MA of order 12, the exact size of 
the fit period, as a mean to keep the method smooth. 

c. Single Exponential Smoothing (SES) 

In this method, the forecast is a function of the immediate past forecast, adjusted 
by the last forecast error. 
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/r+1 =/r+«(^J 


( 3 . 6 ) 


where; 

^ = smoothing factor. It is a chosen fixed value between zero and one; 

= forecast error, Equation (2.1) 

The forecast error is used to correct the past forecast value to the opposite 
direction, when calculating the next forecast. Hence, a plays the important role of 
weighting the importance of the last forecast error. Higher values of (X makes the impact 
of last forecast error, on the next forecast, to be higher. As a values increases, the method 
turns more responsive, or less smooth. The opposite condition also holds, as lower a 
values imply a more smooth method. Hence, an a value can be calculated to optimize the 
results in a specific accuracy metric. However, when the value is found, it is used as a 
constant throughout the time, thus disregarding any possible change in demand pattern. 

Finally, this method implies the use of two parameters, before initiating the 
calculations. The first is a and the second is value, from which all the subsequent 

forecast values and forecast errors are generated and adjusted. Although we acknowledge 
the possibility of finding optimal values of the two parameters, our forecast model fixes 

« = 0.1 and fx=(\. 

d. Adaptive-Response-Rate Single Exponential Smoothing (ARRSES) 

This method aggregates the idea of a flexible a to the SES method. Therefore: 

/<+i (3-V) 


65 



where; 


A, =/?<.,+(1-/?)A_, 

w,=/»KI+(i-/»)w, 

is a eonstant value between zero and one and relates to the degree in whieh a 
values are allowed to vary, along the time. The initialization of ARRSES eomprises a 
bigger set of fixed parameters, as opposed to the SES that needs only /j and a values. 
Our foreeast model eonsiders the same parameters used by (Makridakis et ah, 1993): 

~ ^1 5 

«2 = «3 = «4 = 0 . 2 ; 

y9 = 0.12; 

A; = Ml = 0 


e. Combination 


As mentioned in the literature review, there is an expeeted gain in applying a 
eombination of foreeast methods, when all of them individually generate poor results. 
Henee, this method is just a simple average of foreeast values obtained by the other four 
methods exposed thus far. The eorresponding equation is: 


/,=-Z4 


m 


x=l 


(3.8) 


where; 

X = method index 

ft,x = foreeast generated by the eorresponding method for the index x, at time t 
m = amount of methods to be eombined 
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Therefore, our forecast model applies indexes from one to four to the previous 
methods, resulting in the use of m = 4. 

/. Exponential Smoothing with Backcasting 

This method is a variation of SES, in which the initialization value of /; is 
obtained by applying the inverse process of forecasting. This particular way to initiate the 
SES was studied and recommended by (Ledolter and Abraham, 1984) and is currently 
used by the Navy’s ERP. Hereafter, we will refer to this variation of SES as the NAVY 
method. 

A short description of how the NAVY method follows: first, the condition f, =a^ 
is applied, meaning that the most recent forecast value equals to the most recent actual 
demand. Then, a fixed a value are applied to obtain backcast values for periods starting 
from t - 1 toward t = \, as opposite to the generation of forecast value, which is 
calculated for the period t + \. Our model applies the same smoothing factor as used by 
the Navy’s ERP. 

The process of generating backcasts is kept until the /j value is obtained. 
Thereafter, a regular SES forecast method can be initiated. 

4, Measure Accuracy at the Item Level 

After calculating all the different forecasts for the test period, some process has to 
take place to identify the most accurate method. The following chart aims to show the 
different forecasts generated and how difficult it can be to rank the methods by accuracy. 
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Figure 14. Sample of Forecast Generation 



^^Actual Demand - MA -SA - SES - ARRSES COMB - NAVY 

The vertical axel shows the demand sizes. The colored curves show the different forecasts 
generated for the same item exposed in the Figure 13. . It also shows that the differences in 
accuracy, among the methods, are not always visually identifiable. 

In order to utilize a quantitative approach for the selection of the best forecast 
method for a specific item, we applied a pool of four accuracy metrics. All the accuracy 
metrics used in this analysis were discussed in detail in Chapters 11 and 111. 

First, we selected MAE and MSE, respectively Equations (2.4) and (2.2), as they 
are reported to be commonly used in real situations and can generate valid results when 
actual demands are zero. The fragility of generating numbers with units does not harm 
the result’s quality at the item level. Additionally, we selected CIMIPi* and MASE, 
respectively Equations (3.3) and (2.23), because the first is currently used by DOD, to 
assess the component’s performance, and the second is the alternative metric presented in 
Chapter 111, while making the comparative analysis. 

Table 24. summarizes the results of four forecast accuracy measurements for 
each one of the six forecast methods applied to a randomly selected sample item from the 
dataset. 
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Table 24. Summary of Accuracy Results 


NIIN 

14646078 

Demand Description 
Mean 1837.75 

STD 196.8079146 

CV 0.107091778 


Forecast Methods 


Simple average Moving average SES 


MAE 

MSE 

MASE 

CIMIP 


174.59 

36796.09 

0.79 

0.90 


MAE 171.72 

MSE 

MASE 

CIMIP 


36912.69 

0.78 

0.91 


MAE 

MSE 

MASE 

CIMIP 


182.20 

37129.76 

0.83 

0.90 


ARRSES 

MAE 218.27 

MSE 57422.29 

MASE 0.99 

CIMIP 0.88 


Combination NAVY 

MAE 185.01 MAE 195.23 

MSE 40122.25 MSE 43940.14 

MASE 0.84 MASE 0.89 

CIMIP 0.90 CIMIP 0.89 


Highlighted in yellow are the aeeuraey metries’ ehoiees of most aeeurate foreeast 
methods. 


5, Rank the Forecast Methods hy Accuracy Metric 

In order to identify the best and worst forecast method for any particular item, we 
generated rankings for each one of the accuracy metrics. MAE, MSE and MASE results 
are considered better when values are low. On the other hand, CIMIPi* results are 
considered better as the values are high. 

The following table considers the results exposed in Table 24. to form the 
rankings within each one of the accuracy metrics used. 


Table 25. Ranking of Forecast Methods by Accuracy Metric 


Simple average 

Moving average 

Simple Exponential Smoothing 

ARRSES 

Combination 

NAVY 

MAE 

2 

MAE 

1 

MAE 

3 

MAE 

6 

MAE 

4 

MAE 5 

MSE 

1 

MSE 

2 

MSE 

3 

MSE 

6 

MSE 

4 

MSE 5 

MASE 

2 

MASE 

1 

MASE 

3 

MASE 

6 

MASE 

4 

MASE 5 

CIMIP 

2 

CIMIP 

1 

CIMIP 

3 

CIMIP 

6 

CIMIP 

4 

CIMIP 5 


For this particular item, using MAE as the selected accuracy metric, Moving Average is 
the forecast method that is expected to minimize the errors between forecast values and 
actual demand. 


6, Count of Best Ranks 

This analysis aims to investigate the skewness of best ranks distribution, 
considering the underlying methodologic differences of the four accuracy metrics 
mentioned. In other words, we test if a particular forecast method is considered the most 
accurate for the majority of items contained in the trimmed data. 
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7, Generate Overall Accuracy Ranking at the Item Level 

We consider that the most accurate method to forecast demand, for the specific 
item considered, is the one that generates the lowest median of ranks, as shown in Table 
26. and Table 27. 


Table 26. Overall Ranks 



SA 

MA 

SES 

ARRSES 

COMB 

NAVY 

Overall rank 

2 

1 

3 

6 

4 

5 


Table 27. Best and Worst Forecast Methods 

Best Method MA 

Worst method _ ARRSES 

Considering all four accuracy metrics’ results, Moving Average is considered the most 
accurate forecast method for this item, as it generates the lowest overall rank. 

8, Build Clusters 

In order to allow the investigation of the possibility of one forecast method to be 
capable of outperforming all the others for a specific group of items, we created 11 
clusters of items, each one of those corresponding to a specific range of coefficients of 
variation (CV). Hendricks and Robey (1936) explain the coefficient of variation as the 
ratio of the standard deviation of a number of measurements to their arithmetic mean. 
This ratio provides a standard for overall variability assessment since the number is scale 
free, and can be used to compare datasets. 

The following histogram shows the CV clusters, along with the amount of items 
contained. 
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Figure 15. Histogram of Coefficient of Variation 
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9, Generate Rankings on Clusters of Coefficient of Variation 

In order to elect the best forecast method for a specific cluster of coefficient of 
variation, we counted the number of items in which each of the forecast methods was 
considered the best and the worst option. The sample chart below shows how the rank 
results stored in a given cluster of CV. 
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Figure 16. The Best and Worst Methods Within a Cluster 


■ Best Method ■ Worst Method 

392 



SA MA SES ARRSES COMB NAVY 


Results collected in the CV cluster of 0.0-0.4. The vertical axel represents the amount of 
items, while the horizontal axel shows the forecast methods. In this case, ARRSES is the 
most frequently considered best and worst method. That information provides the idea of 
risk involved in the decision of selecting a specific forecast method. 

10, Generate MASE Scores of Clusters 

As a different approach to the use of ranks to track the performance of forecast 
methods, we calculated the average, minimum and maximum MASE values within each 
cluster of coefficient of variation. The intention is to identify a pattern of relative 
performance as the CV increases, compared to what naive method produces. Moreover, 
those three values of MASE, measured along the time, provide the range of possible 
results to inform about the existing risk of choosing that specific method for the entire 
population. 

11, Assess the Relative Performance of Navy’s Forecast Method 

We used the MASE accuracy metric in order to measure the potential gain of 
implementing different forecasting methods, instead of the Navy’s status quo. First, we 
counted the percentage of items in which the NAVY method is not the best, meaning that 
there is opportunity to increase accuracy by using another forecast method. 
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Additionally, we counted the percentage of times that the Navy’s forecast method 
performed worse than naive; which means MASE values higher than one. Then, out of 
that, we counted how many times another method was capable of outperforming the 
naive. 


12, Measure the Level of Agreement between MASE and CIMIPi* 

In order to complement the comparative analysis conducted in the Chapter III, we 
measured the amount of times that rank results of MASE and ClMlPt* agree. The idea is 
to provide the magnitude of the existing theoretical difference among the metrics, using 
real data. 

E, RESULTS 

The model described is used to calculate forecast values, along with the respective 
accuracy scores as a means to identify the method that minimizes the expected error in 
each item. This section presents results grouped in to two categories: accuracy metrics 
and forecast methods. The first category utilizes real data to complement the theoretical 
comparative analysis among CIMIP and MASE accuracy metrics, conducted in Chapter 
III. The second utilizes accuracy measurements as a tool help forecasters in the task of 
optimizing the selection of a forecast method. 

1, Accuracy Metrics 

As mentioned, there are expected qualitative gains in choosing MASE as a 
substitute of CIMIP metric. As the comparative analysis used small sets of hypothetical 
items to demonstrate some characteristics of the metrics, a relevant question remained: do 
the results generated by the new metric represent a significant improvement? 

In order to answer that question, we have to consider that the current procedures 
do not formally involve the use of accuracy measures at the individual level. Components 
are just required to generate aggregated accuracy values to report to DOD as a 
representation of the overall forecast performance. 
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The Navy has tried to implement CIMIPi and LASE, respectively Equations (3.2) 
and (2.26), at the individual level as an internal effort to identify items that represent 
significant sources of inaccuracy within the big data. The vulnerabilities of those metrics 
were exposed in Chapter II, while Chapter III conclude that both MASE and CIMIPi* 
metrics, respectively Equations (2.23) and (3.3), can be used at the individual level. 

However, the model presented in this chapter has a higher ambition on the use of 
accuracy metrics at the individual level. Assuming the generation of multiple forecasts 
per item in a given time, accuracy values can be used as inputs to support the decision of 
selecting the best forecast method. 

Figure 17. shows the agreement level between MASE and CIMIPi* among 
themselves and with the overall rank generated. The agreement level can be explained by 
the percentage of times, considering all items, in which the results of two accuracy 
metrics lead to the same conclusion. This analysis uses ranks as the criteria to set a 
common ground for comparison among the accuracy metrics. 

Figure 17. MASE and CIMIPi* Agreement 


■ Agree ■ Disagree 



All MASE and CIMIP ranks MASE and CIMIP on the best method 


The first bar on the left represents the pereentage of items that MASE and CIMIPi* results 
led to the exaet same ranks for all six foreeast methods used in the model. The seeond bar 
measures the agreement level on eleeting the most aeeurate foreeast method. 
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When forming complete rankings of forecast methods, the methodologic 
difference between MASE and CIMIPi* led to the significant divergence of 25%. 
However, the main objective of the whole model is to provide useful information to 
optimize the selection of the most accurate forecast method for each item. For that matter, 
there is a high agreement level of 94% among the accuracy metrics. 

2. Forecast Methods (Time Series) 

We acknowledge the fact that parameters used to generate accurate forecasts in 
the past do not guarantee high performance in the future. However, based on the 
assumption of demand stationarity, we expect that the selection of the most accurate 
method in past data can result in improvements on future forecast performance. 

This section aims to uncover the existence of patterns that could be used to form a 
decision rule on the selection of the best forecast method. The tests were conducted under 
two main methods: analysis of ranks and MASE results analysis. 

a. Analysis of Ranks 

(1) Whole Population of Items 

Considering the completely trimmed data, we first count the amount of items in 
which the forecast methods were considered the most accurate, by each accuracy metric. 
Results follow: 
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Figure 18. Count of Best Ranks by Aeeuraey Metric 

10000 
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SA MA SES ARRSES COMB NAVY 

■ MAE BMSE BMASE BCIMIH* 

The vertical axel represents the amount of items. 

There is no clear evidence, in the trimmed data, that one forecast method is 
mostly considered the best option. While MSE results are the most skewed toward SA, the 
other three accuracy metrics are slightly skewed toward ARRSES. 

In order to enable a clear visualization of the overall skewness of ranks, among 
the forecast methods, we consolidated the counts of the four accuracy metrics. Results are 
shown in Figure 19. 
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Figure 19. Consolidated Pereentages of Best Ranks 



We found that there is no elear evidenee, in the trimmed data used, to support that 
a particular forecast method is capable of outperform the others in a big majority of 
items. Hence, further analyses are needed to help in the decision of selecting the most 
accurate forecast method. 

(2) Clusters of Coefficient of Variability 

Rather than try to identify the most accurate forecast method for the whole 
population of items, the next analysis investigate the benefits of choosing a specific 
forecast method in groups of items that have similar demand behaviors, in terms of 
amount of variability. Hence, the following analysis applies a rank analysis, utilizing 
clusters of CV to group items and. Results follow. 
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Figure 20. Best and Worst Forecast Methods by Cluster 
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The relation between the blue and the red bars provides an idea of risk involved in the 
choice of one fixed method to be used in a whole cluster of items. 


There is a pattern, along the clusters, of high risk in selecting one forecast method 
to be applied to the whole group of items. Just as an example, ARRSES was most elected 
best method, all clusters combined. At the same time, it was considered the worst option 
more times than all others. Hence, we can state that there is a significant risk of 
inaccuracy in choosing one method to be used in a cluster of CV. 

Another relevant investigation is about the potential existence of upwards or 
downwards trends on forecast method ranks, as CV increases. Figure 21. shows how 
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each forecast method is ranked on the clusters of CV, based only on the amount of times 
it was considered the best option. 


Figure 21. Average Rank Variation by Clusters 



The vertical axes represent the aggregated rank, which is related to the number of times 
one method was considered the best option within each cluster. Trend lines are in black. 


The Combination method shows a constant worst ra nk in all clusters of CV, that 
does not mean that it is the absolute worst method. What it does mean is that it is not 
often the best method, not considering the insignificant amount of items in which it was 
considered the worst method. Additionally, trend lines help to explain a significant 
amount of variance in results of two methods. ARRSES tends to lose rank as CV 
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increases, though not uniformly, while NAVY method, not uniformly, tends to gain ranks 
as variability inereases. 

The drawback of the analysis of ranks is that it does not provide an aceurate sense 
of differentiation between methods. As shown in Figure 20. , differenees in eounts of best 
rank, among the methods, sometimes are signifieant or elearly irrelevant. Therefore, 
analysis of ranks may distort the existing aceuraey differenee between the methods. 

b. Analysis ofMASE Results 

In this section we analyze MASE results eolleeted in the test period to seleet the 
foreeast method to be used thereafter. The first analysis is set to investigate whether 
foreeast methods behave differently as the eoeffieient of variation inereases, in order to 
indieate the use of one for items with less variable demands and another for items with 
more variable demands. 

Figure 22. shows how MASE minimum, maximum and average values of eaeh 
foreeast method ehange as CV inereases. 
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Figure 22. MASE Values per Foreeast Method 
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All the charts utilize exponential vertical axes to capture the entire spectrum of possible 
results. The horizontal axes correspond to the clusters of coefficient of variation. 
Maximum and minimum values provide the idea of the risk involved in the selection of the 
method as a fixed solution. 


All forecast methods considered in the model generate similar shapes of 
maximum and average curves. However, ARRSES and NAVY methods are capable of 
generate the lowest minimum values, thus spreading the range of possible values by 
allowing significantly accurate forecasts at high values of CV. 
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The similarity of accuracy curves’ shapes shows that there is no evidence that the 
selection of forecast method according to low or high variability will represent in any 
accuracy improvement. That similarity is partially explained by the fact that the forecast 
methods used in the model are classified as quantitative and time series. Hence, they are 
all based on the same assumption of demand stationarity, as they use historical data to 
predict future values. Furthermore, time series forecast methods can be considered 
responsive or smooth, depending on the parameters used. SA is a smooth method by 
nature, while the k, a and P values used respectively in MA, SES and ARRSES, made 
them behave as smooth methods as well. Combination method can also be considered 
smooth as it averages the forecasts of previous four methods. NAVY method is the most 
responsive in the model, as it uses« = 0.3 . 

Figure 23. shows the six MASE average curves together, corresponding to the 
forecast methods applied in the model, to evidence the similarity in terms of forecast 
accuracy values. 

Figure 23. Average MASE Results by Forecast Method 
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The vertical axel is comprised by MASE values and was intentionally cut at 2.0, as the 
values continue to increase and values higher than 1.0 are considered worse than naive 
method. For low CV values, accuracy results are similar, but they tend to diverge as CV 
increases. 
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In order to optimize the quality of forecast results, we can apply the average 
MASE value of 1.0 as a threshold to consider that the use of one specific forecast method 
is recommendable, because it is capable of outperforming the naive method 
systematically. For items with higher values of CV, deeper attention is needed to support 
the forecasting process. 

Applying that threshold, we found that none of the forecast methods used in the 
model has systematic superior performance than naive method for CV values higher than 
1.6, while all of them can outperform, on average, the naive method for CV values lower 
than 1.6. Hereafter, we will refer to the range of 0 < CV < 1.6 as the “selected data”. 

Figure 24. shows the same results as in Figure 23. , but in a different scale, as its 
MASE values are limited to 1.0. 

Figure 24. Average MASE Results in the Selected Data 
1.00 - 
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Within the range of CV that the forecast methods can be used to outperform the naive 
benchmark, NAVY method is systematically considered the best option. 

Although the NAVY method had better performance in all clusters of CV in the 
selected data, we identified a risk in using a fixed forecast method for a group of items. 
Hence, we investigated the potential benefit on accuracy when the most accurate forecast 
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method is selected for each item, what we call as Flexible Method, instead of working 
with a fixed method. 

Figure 25. shows that the adoption of Flexible Method in the selected data 
resulted in a significant gain of accuracy, when compared with each one of the forecast 
methods applied individually. 

Figure 25. Accuracy Gain of Flexible Method 


0.807 

I 

SA MA SES ARRSES COMB NAVY Flex Model 

The bars represent the average of MASE results for items with CV < 1.6. 

Additionally, Figure 26. shows that the Flexible Method not only has superior 
accuracy than the NAVY method, which was considered the most accurate among the six 
methods applied in the selected data, but it is capable of extending the range of CV in 
which it can be used to systematically outperform naive benchmark. 
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Figure 26. MASE Values of NAVY and Flexible Method 
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Flexible Method resulted in a signifieantly superior accuracy in all clusters of CV in the 
selected data. Additionally, it generates average MASE<\ for the CV cluster (1.6-2.0), 
what extend the overall range of CV values in which the use of time series forecast 
methods is expected to outperform naive the method. 


Therefore, eonsidering the data used, the adoption of Flexible Method represented 
a signifieant gain in foreeast aeeuraey as well as an extension in the number of items that 
time series foreeasts were eonsidered reeommendable. Implementing our findings, all six 
time series foreeast methods, if applied as a fixed solution, were reeommendable for 
17,437 items, what represents 57.22% of the trimmed data. Meanwhile, utilizing the same 
eriteria, the Flexible Method is eonsidered reeommendable in 22,256 items, thus 
representing 73.04% of the trimmed data. 

F. CHAPTER SUMMARY 

After applying a model that ealeulates demand foreeasts and aeeuraey values in 
all items’ data, the most signifieant findings were: 

• Despite the methodologie differenees and theoretieal superiority of MASE 
over CIMIPi*, both generated a very high level of agreement, while 
seleeting the most aeeurate foreeast method; 

• The ealeulation of foreeast aeeuraey ean be used by the foreeasters as a 
managerial tool, instead of just fulfilling the need of reporting; 
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• In order to provide information that helps to improve the foreeasting 
proeesses, aeeuraey has to be ealeulated at the item level; 

• All foreeast methods applied in the model tend to be less accurate than the 
naive method, as CV increases; 

• Using averages of MASE values, the NAVY was considered the most 
accurate of all six forecast methods used in the model for all clusters of 
CV< 1.6. 

• The use of Flexible Method resulted in a significant gain of accuracy, 
when compared to any of the other forecasting methods applied 
individually. 
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V. FINDINGS, RECOMMENDATIONS AND FUTURE 

RESEARCH 


A. FINDINGS 

While many of our findings throughout this research are detailed at the point of 
discussion, the major findings of our research in regard to CIMIPf and forecast accuracy 
measurement are summarized below for ease of access. 

1. C/M/P/Weaknesses 

CIMIPf is not able to produce accuracy results for individual line items when the 
actual demand for that item during the period, usually one year, is zero. This complicates 
the individual line item assessment of forecast accuracy, since CIMIPf returns an invalid 
division-by-zero result. This weakness does not prevent the aggregation of results for 
multiple line items because of the summation that occurs in the denominator prior to the 
final calculation. 

C/M/P/results are significantly affected by the unit costs that are included in both 
the numerator and denominator of the equation. The inclusion of unit cost as an 
independent variable in CIMIPf detracts from the primary purpose of measuring forecast 
accuracy performance. 

CIMIPf produces aggregated results that are not inherently intuitive and are 
disproportionately affected by over-estimations. This is especially evident with low 
demand items where the possibility of the size of the error exceeding actual demand is 
greater. We found that the aggregate CIMIPf for 28,235 low demand items produced a 
large negative result (-314%), while the aggregate C/M/P/for 15,690 high demand items 
produced a modest positive result (58%). As another example of the effect of unit cost, 
due to the high dollar weighting for the high demand group the total CIMIPf result was 
48%. 

CIMIPf does not consider the difficulty of accurately forecasting the entirety of 

material that the services and DLA are charged with managing. Its lack of a 

benchmarking function, similar to the one found in MASE, results in CIMIPf directly 
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comparing the forecasting performance of the services and DLA against each other. 
Although we did not compare the performance of the Navy versus DLA, without 
consideration of performance benchmark, the services could be penalized for what is 
considered to be poor performance or incentivized to make risky decisions in an effort to 
improve forecasting performance. 

2. Forecast Accuracy 

There has been significant study on the topic of forecast accuracy within the 
academic world. Among a large amount of forecast accuracy metrics currently available 
in literature, MASE was considered useful and theoretically superior than all variants of 
CIMIP. 

From the perspective of IM’s at WSS, the measurement of accuracy at the item 
level generates more value than one aggregated accuracy number, as currently required 
by DOD. 

Item accuracy measurements enable a better identification of poorly forecasted 
items and can also be applied as a managerial tool for determining which forecast method 
to utilize. 


3. Demand Forecasting 

The task of demand forecasting within the DOD is very complex because demand 
patterns are significantly heterogeneous. Using MASE as the forecast accuracy 
measurement, we found that the Navy’s preferred forecasting method, on average, out¬ 
performed the other five methods when compared to the naive method and when CV was 
less than 1.6. Additionally, flexibility in the choice of forecasting method at the 
individual item level, enabled our test data to outperform the naive method when CV was 
less than 2.0. 

B, RECOMMENDATIONS 

1, DOD 

The following are recommendations for the DOD to improve demand forecasting: 
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a. Replace CIMIPj with MASE as the Aggregate Forecast Accuracy 
Measurement of Record 

As we have shown, MASE is superior to CIMIPf in its ability to provide intuitive 
results aeross more demand patterns, while also avoiding distortions from unit cost and 
demand volume. The built in benchmarking of the MASE equation will also enable the 
DOD to more accurately assess the forecasting performance of the services and DLA. 

b. Consider the Naive Method as a Basis for Department Benchmarks 

Direct comparison of demand forecasting performance between the services and 
DLA using an absolute error metric, such as CIMIPf, does not consider the difficulty of 
forecasting for the unique materiel populations. A department-wide goal that arbitrarily 
declares a certain accuracy percentage as acceptable does not accurately reflect the 
complexity of the task and has the potential to drive counter-productive behavior in an 
effort to reach the goal. A better measure of demand forecasting performance would 
utilize a benchmarked metric, such as MASE, and then set the standard as outperforming 
the benchmark. In the case of MASE, which uses the naive method as a benchmark, this 
would encourage the services and DLA to attain an aggregate forecast accuracy score 
equal to or less than some number less than one. 

2, Navy 

The following are recommendations for the Navy to improve demand forecasting: 

a. Transition to Flexible Forecasting Methods at the Item Level 

As we have shown, the Navy’s current forecasting method of exponential 
smoothing with backcasting outperforms the naive method on average when the CV is 
less than 1.6. If NAVSUP’s forecasters had flexibility in their choice of forecasting 
method, then on average, they would be able to select an analytical forecasting method 
that outperformed the naive method when the CV of an item is less than 2.0. The 
complexity of generating accurate demand forecasts for such a diverse set of items does 
not lend itself to using only one analytical forecasting method. As the ERP program 
improves its capabilities, the Navy would benefit from more flexibility in its forecasting 
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methods. The ideal approach would be to apply multiple forecast methods to the 
historical data of each line item and then choose the forecast method that optimizes the 
MASE result, or whichever accuracy metric the Navy utilizes. 

b. Utilize MASE to Analyze Forecast Accuracy at the Item Level 

MASE has advantages over both the CIMIPf and EASE equations and utilizing it 
as a forecast accuracy measurement will enable WSS to better identify specific line items 
that have not been well forecasted over time even when actual demand is zero. 

c. Publish a NA VS UP Demand Forecasting Procedures Instruction 

During the course of our research we could not locate a NAVSUP instruction that 
detailed the procedures that WSS shall use to generate demand forecasts for all of the 
various situations and how to measure those results. While there are internal business 
rules and other technical ERP documents, an instruction of this type would ensure a 
broader understanding of demand forecasting across the Navy and open up the process 
for constructive criticism that could lead to improved results. 

C. AREAS FOR FUTURE RESEARCH 

The challenge of accurately forecasting demand across the DOD is not a simple 
matter and the recommendations we have offered here are not likely to solve all of the 
issues that prevent the DOD from improving forecast performance. During the course of 
our research we looked at many segments of this issue that we did not have the 
opportunity to explore further. Some of these ideas may generate constructive 
improvements while others may not. The following are non-mutually exclusive ideas that 
we feel deserve further study in order to improve demand forecasting within the Navy 
and DOD. 

1. Item Manager Discretion to Adjust ERP Derived Forecast 

In our discussions with NAVSUP we learned that after ERP develops demand 
forecasts using the exponential smoothing with backcasting method these forecasts are 
subject to IM review and possible adjustment. We feel that it would be worthwhile to 
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compare the effectiveness of the IM adjusted foreeasts to the original ERP developed 
foreeast. A comparison of the actual demand data to the original and adjusted foreeasts 
should reveal if the IM adjusted forecasts result in more or less aecurate forecasts than 
the original ERP derived forecast. The seope of this research could examine all ECI’s or 
just a speeifie ECI-subset, sinee NAVSEIP uses different foreeasting methods to generate 
foreeasts for eaeh LCI group 

Additionally, surveys of the IM’s eould determine the leading reasons for 
adjusting an ERP-derived foreeast. A eomparison of these IM provided reasons with the 
aetual foreeast performanee eould help determine whieh reasons generally result in more 
accurate foreeasts and whieh generally result in less aeeurate foreeasts. If the human 
survey portion is ineluded, the NPS researeher would need to attain permission from the 
human researeh proteetion program offiee and the institutional review board. A study of 
this kind would also require the full support of NAVSUP and aceess to the IM’s. 

2. Explore the Use of Retail Level Demand in Forecast Development 

To develop demand foreeasts, NAVSUP uses quarterly wholesale level demand 
over a five-year period. While this data provides a good proxy for aggregated retail 
demand and is easier to obtain, it also results in less frequent demand oeeurrenees and 
eould hide demand patterns. Although retail level demand can be ehallenging to organize 
and interpret, it may provide a better data set to generate demand foreeasts. In multi- 
eehelon supply chains, demand information from the end user level must be tracked in 
order to mitigate the negative impaets of the bullwhip effeet. When demand variability at 
the retail level is eombined with a laek of eommunieation up the supply ehain, exeess 
inventory is likely to form at all levels. CIMIP has addressed inventory visibility 
ehallenges, but sharing of end-eustomer demand information ean also help to reduee 
unnecessary inventory. We propose an analysis of whether properly trimmed retail level 
demand can provide a better demand foreeast for items that have traditionally been 
diffieult to foreeast with only wholesale level demand. 
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3, Explore Alternatives to Managing Material by Life Cycle Indicator 

The Navy currently uses LCI’s from one to six to segregate material based on the 
maturity of the parent program that it supports. Initially for LCI-1, when demand is non¬ 
existent, engineering estimates are used to develop forecasts. As the item progresses to 
the next LCI categories these engineering estimates begin to factor in observed demand 
in order to develop forecasts. By the time an item is classified as an LCI-4 or -5 the 
analytical forecast is based solely on observed demand. While in general this makes 
sense, it may be possible that items could be more effectively managed and forecasted if 
they were placed into groups based on other criteria, instead of their parent programs’ life 
cycle. We propose a study to determine what these more effective sorting criteria are and 
how best to employ them. 

4, Time Periods and Fractions 

The Navy currently uses five years of wholesale level demand, sorted into 20 
quarterly buckets, to generate a single number demand forecast for the 21®* quarter. To 
obtain a 12-month forecast the quarterly forecast number is multiplied by four. This 
single number is not always a whole integer. We propose a study of the effect of using 
different time buckets (days, weeks, months, etc.), different historical time periods (1, 3, 
7, etc. years) and the treatment of fractional demand forecasts (round up, round down, no 
rounding, etc.) to potentially generate more accurate forecasts. 

5, Investigate the Use of Alternative Forecasting Methods 

The mathematical model presented in Chapter IV aims to generate improvement 
in forecast accuracy. However, it is not sufficient to select methods with the best MASE 
values throughout the entire curve, disregarding the fact that they can be worse than the 
naive method. That method is considered to be a rudimentary prediction tool and still 
systematically outperforms the simple forecast methods used in this research for items 
with CV >2.0. While we cannot recommend its blanket utilization for those items, we 
propose an investigation of the potential benefits of using either more complex time- 
series forecasting methods or alternative forecasting methods such as causal, qualitative, 
and expert estimates. 
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6, Analyze the DOD Bias Metric 

The initial concerns of Congress and GAO, in dealing with the issue of excessive 
secondary inventory, seemed to be more focused on reducing the bias to over-forecast 
instead of improving forecast accuracy. While the focus today seems to have shifted 
away from bias toward accuracy, there is still a requirement to measure bias in 
forecasting. The DOD business rules that defined the accuracy metric also laid out the 
procedures for utilizing the bias metric. As we have discussed, our research centered on 
the accuracy metric, but the bias metric, as defined in Equation (2.25), could also benefit 
from a further analysis of its strengths and weaknesses. 

7, Portfolio Theory Approach 

Portfolio theory indicates that an investor can optimize the trade-off between risk 
and reward through diversification. If we apply that rationale to the flexible forecasting 
model, better results are possible when the pool of forecasting methods reflects a large 
spectrum of responsiveness, and is comprised of specific methods to deal with trends, 
seasonality and intermittent demand. We propose an investigation of the benefits of 
applying a portfolio theory rationale to the flexible forecasting model. 

8, Grouping Method 

In our research, we grouped items into CV clusters as an attempt to identify 
methods that are expected to outperform others for a particular range of variability. 
However, that grouping method was not able to segregate items in a way that one specific 
forecasting method outperformed the others. We acknowledge the possibility of grouping 
items in different ways, like demand patterns, clusters of unit costs, clusters of dollar 
demand, etc. However, forecasting method selection at the item level is more likely to 
produce more accurate forecasts than any other kind of grouping. Individualized forecasts 
are likely to require significantly more effort, so we propose an analysis to determine if 
this additional effort at the item level pays-off, in terms of marginal gains in accuracy. 
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9, Optimization of Parameters 

Parameters used to initiate the ealeulations of forecast values in each of the 
methods that we tested were arbitrarily chosen. The intent of our research was to uncover 
potential opportunities of improvement by applying a flexible forecasting model. We 
propose further investigation of the results generated if the parameters were optimized for 
each item. 


10, Apply Statistical Tools to Generalize Results 

During our analysis of the DOD’s accuracy metric, we utilized quick, 
hypothetical tests to uncover evidence of inherent flaws within CIMIPf. The simplicity of 
these tests unfortunately means that the findings are not supported by any statistical 
analysis and cannot be generalized to larger datasets. Therefore, we propose statistical 
analyses on the impacts of the CIMIPf flaws that we identified. 
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